msg89058 - (view) |
Author: Neil Muller (Neil Muller) |
Date: 2009-06-07 21:30 |
In py3k, ElementTree no longer correctly converts characters to entities
when they can't be represented in the requested output encoding.
Python 2:
>>> import xml.etree.ElementTree as ET
>>> e = ET.XML("<?xml version='1.0'
encoding='iso-8859-1'?><body>t\xe3t</body>")
>>> ET.tostring(e, 'ascii')
"<?xml version='1.0' encoding='ascii'?>\n<body>tãt</body>"
Python 3:
>>> import xml.etree.ElementTree as ET
>>> e = ET.XML("<?xml version='1.0'
encoding='iso-8859-1'?><body>t\xe3t</body>")
>>> ET.tostring(e, 'ascii')
.....
UnicodeEncodeError: 'ascii' codec can't encode characters in position
1-2: ordinal not in range(128)
It looks like _encode_entity isn't ever called inside ElementTree
anymore - it probably should be called as part of _encode for characters
that can't be represented.
|
msg89059 - (view) |
Author: Neil Muller (Neil Muller) |
Date: 2009-06-07 21:47 |
Simple possible patch uploaded
This doesn't give the expected answer for the test above, but does work
when starting from an XML file in utf-8 encoding. I still need to
determine why this happens.
|
msg89276 - (view) |
Author: Neil Muller (Neil Muller) |
Date: 2009-06-12 12:50 |
> This doesn't give the expected answer for the test above
Which is obviously due to not comparing apples with apples, as I should
be using a byte-string in the py3k example.
>>> import xml.etree.ElementTree as ET
>>> e = ET.XML(b"<?xml version='1.0'
encoding='iso-8859-1'?><body>t\xe3t</body>")
>>> ET.tostring(e, 'ascii')
Fails without the patch, behaves as expected with the patch.
|
msg89504 - (view) |
Author: Neil Muller (Neil Muller) |
Date: 2009-06-18 13:12 |
Updated patch - adds a test for this.
|
msg89505 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-06-18 13:25 |
This regression is probably annoying enough to make it a blocker.
|
msg89583 - (view) |
Author: Fredrik Lundh (effbot) *  |
Date: 2009-06-21 21:33 |
Umm. Isn't _encode used to encode tags and attribute names? The charref
syntax is only valid in CDATA sections and attribute values, which are
encoded by the corresponding _escape functions. I suspect this patch will
make things blow up on a non-ASCII tag/attribute name.
|
msg89585 - (view) |
Author: Fredrik Lundh (effbot) *  |
Date: 2009-06-21 21:42 |
Did you look at the 1.3 alpha code base when you came up with this idea?
Unfortunately, 1.3's _encode is used for a different purpose...
I don't have time to test it tonight, but I suspect that 1.3's
escape_data/escape_attrib functions might work better under 3.X; they do
the text.replace dance first, and then an explicit text.encode(encoding,
"xmlcharrefreplace") at the end. E.g.
def _escape_cdata(text, encoding):
# escape character data
try:
# it's worth avoiding do-nothing calls for strings that are
# shorter than 500 character, or so. assume that's, by far,
# the most common case in most applications.
if "&" in text:
text = text.replace("&", "&")
if "<" in text:
text = text.replace("<", "<")
if ">" in text:
text = text.replace(">", ">")
return text.encode(encoding, "xmlcharrefreplace")
except (TypeError, AttributeError):
_raise_serialization_error(text)
|
msg89623 - (view) |
Author: Jerry Chen (jcsalterego) |
Date: 2009-06-23 05:37 |
The attached patch includes Neil's original additions to test_xml_etree.py.
I also noticed that _encode_entity wasn't being called in ElementTree in
py3k, with the important bit being the nested function
escape_entities(), in conjunction with _escape and _escape_map.
In 2.x, _encode_entity() is used after _encode() throws Unicode
exceptions [1], so I figured it would make sense to take the core
functionality of _escape_entities() and integrate it into _encode in the
same fashion -- when an exception is thrown.
Basically, I:
- changed _escape regexp from using "[\x0080-\uffff]" to "[\x80-xff]"
- extracted _encode_entity.escape_entities() and made it
_escape_entities of module scope
- removed _encode_entity()
- added UnicodeEncodeError exception in _encode()
I'm not sure what the expected outcome is supposed to be when the text
is not type bytes but str. With this patch, the output has
b"tãt" rather than b"tãt".
Hope this is a step in the right direction.
[1] ElementTree.py:814, ElementTree.py:829, python 2.7 HEAD r50941
|
msg89684 - (view) |
Author: Fredrik Lundh (effbot) *  |
Date: 2009-06-24 21:51 |
That's backwards, unless I'm missing something here: charrefs represent
Unicode characters, not UTF-8 byte values. The character "LATIN SMALL
LETTER A WITH TILDE" with the character value 227 should be represented as
"ã" if serialized to an encoding that doesn't support non-ASCII
characters.
And there's no need to use RE:s to filter things under 3.X; those parts of
ET 1.2 are there for pre-2.0 compatibility.
Did you try running the tests with the escape function I posted?
|
msg89690 - (view) |
Author: Jerry Chen (jcsalterego) |
Date: 2009-06-24 23:56 |
Thanks for the explanation -- looks like I was way off base on that one.
I took a look at the code you provided but it doesn't work as a drop-in
replacement for _escape_cdata, since that function returns a string
rather than bytes.
However taking your code, calling it _encode_cdata and then refactoring
all calls _encode(_escape_cdata(x), encoding) to _encode_cdata(x,
encoding) seems to do the trick and passes the tests.
Specific example:
- file.write(_encode(_escape_cdata(node.text), encoding))
+ file.write(_encode_cdata(node.text, encoding))
One minor modification is to return the string as is if encoding=None,
just like _encode:
def _encode_cdata(text, encoding):
# escape character data
try:
text = text.replace("&", "&")
text = text.replace("<", "<")
text = text.replace(">", ">")
if encoding:
return text.encode(encoding, "xmlcharrefreplace")
else:
return text
except (TypeError, AttributeError):
_raise_serialization_error(text)
|
msg89715 - (view) |
Author: Benjamin Peterson (benjamin.peterson) *  |
Date: 2009-06-25 21:26 |
effbot, do you have an opinion about the latest patch? It'd be nice to
not have to delay the release for this.
|
msg89718 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2009-06-26 01:12 |
I disagree with this report being classified as release-critical - it is
*not* a regression over 3.0 (i.e. 3.0 already behaved in the same way).
That it is a regression relative to 2.x should not make it
release-critical - we can still fix such regressions in 3.2.
In addition, there is an easy work-around for applications that run into
the problem - just use utf-8 as the output encoding always:
py> e = ET.XML(b"<?xml version='1.0'
encoding='iso-8859-1'?><body>t\xe3t</body>")
py> ET.tostring(e,encoding='utf-8')
b'<body>t\xc3\xa3t</body>'
|
msg89722 - (view) |
Author: Raymond Hettinger (rhettinger) *  |
Date: 2009-06-26 03:30 |
+1 for Py3.1.1
|
msg89728 - (view) |
Author: Jerry Chen (jcsalterego) |
Date: 2009-06-26 14:21 |
Either way, it would be nice to get feedback so we can iterate on the
patch or close out this issue already :-)
|
msg95045 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-11-08 20:43 |
The patch looks ok to me.
|
msg99125 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2010-02-09 16:56 |
Committed in r78123 (py3k) and r78124 (3.1). I've also removed _escape_cdata() since it wasn't used anymore. Thanks Jerry for the patch.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:56:49 | admin | set | github: 50482 |
2010-02-09 16:56:25 | pitrou | set | status: open -> closed resolution: fixed messages:
+ msg99125
stage: resolved |
2009-11-08 20:43:14 | pitrou | set | messages:
+ msg95045 |
2009-06-26 14:21:19 | jcsalterego | set | messages:
+ msg89728 |
2009-06-26 11:13:39 | pitrou | set | priority: release blocker -> critical versions:
+ Python 3.2, - Python 3.0 |
2009-06-26 03:30:55 | rhettinger | set | nosy:
+ rhettinger messages:
+ msg89722
|
2009-06-26 01:12:28 | loewis | set | nosy:
+ loewis messages:
+ msg89718
|
2009-06-25 21:26:50 | benjamin.peterson | set | nosy:
+ benjamin.peterson messages:
+ msg89715
|
2009-06-24 23:56:04 | jcsalterego | set | files:
+ issue6233-encode_cdata.diff
messages:
+ msg89690 |
2009-06-24 21:51:25 | effbot | set | messages:
+ msg89684 |
2009-06-23 05:37:25 | jcsalterego | set | files:
+ issue6233-escape_entities.diff nosy:
+ jcsalterego messages:
+ msg89623
|
2009-06-21 21:42:03 | effbot | set | messages:
+ msg89585 |
2009-06-21 21:33:01 | effbot | set | messages:
+ msg89583 |
2009-06-18 13:25:14 | pitrou | set | priority: release blocker nosy:
+ pitrou messages:
+ msg89505
|
2009-06-18 13:12:25 | Neil Muller | set | files:
+ issue6233_py3k_with_test.diff
messages:
+ msg89504 |
2009-06-12 12:50:24 | Neil Muller | set | messages:
+ msg89276 |
2009-06-07 21:47:42 | Neil Muller | set | files:
+ issue6233_py3k.diff keywords:
+ patch messages:
+ msg89059
|
2009-06-07 21:30:58 | Neil Muller | create | |