Issue 6233: ElementTree (py3k) doesn't properly encode characters that can't be represented in the specified encoding

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/50482

classification

Title:	ElementTree (py3k) doesn't properly encode characters that can't be represented in the specified encoding
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.1, Python 3.2

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	Neil Muller, benjamin.peterson, effbot, hodgestar, jcsalterego, loewis, pitrou, rhettinger
Priority:	critical	Keywords:	patch

Created on 2009-06-07 21:30 by Neil Muller, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue6233_py3k.diff	Neil Muller, 2009-06-07 21:47	Simple patch
issue6233_py3k_with_test.diff	Neil Muller, 2009-06-18 13:12	Combined patch with test case
issue6233-escape_entities.diff	jcsalterego, 2009-06-23 05:37	Excp handling in _encode + prev submitted test
issue6233-encode_cdata.diff	jcsalterego, 2009-06-24 23:56	_encode & effbot's _escape_cdata, w/ test

Messages (16)
msg89058 - (view)	Author: Neil Muller (Neil Muller)	Date: 2009-06-07 21:30
In py3k, ElementTree no longer correctly converts characters to entities when they can't be represented in the requested output encoding. Python 2: >>> import xml.etree.ElementTree as ET >>> e = ET.XML("<?xml version='1.0' encoding='iso-8859-1'?><body>t\xe3t</body>") >>> ET.tostring(e, 'ascii') "<?xml version='1.0' encoding='ascii'?>\n<body>tãt</body>" Python 3: >>> import xml.etree.ElementTree as ET >>> e = ET.XML("<?xml version='1.0' encoding='iso-8859-1'?><body>t\xe3t</body>") >>> ET.tostring(e, 'ascii') ..... UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2: ordinal not in range(128) It looks like _encode_entity isn't ever called inside ElementTree anymore - it probably should be called as part of _encode for characters that can't be represented.
msg89059 - (view)	Author: Neil Muller (Neil Muller)	Date: 2009-06-07 21:47
Simple possible patch uploaded This doesn't give the expected answer for the test above, but does work when starting from an XML file in utf-8 encoding. I still need to determine why this happens.
msg89276 - (view)	Author: Neil Muller (Neil Muller)	Date: 2009-06-12 12:50
> This doesn't give the expected answer for the test above Which is obviously due to not comparing apples with apples, as I should be using a byte-string in the py3k example. >>> import xml.etree.ElementTree as ET >>> e = ET.XML(b"<?xml version='1.0' encoding='iso-8859-1'?><body>t\xe3t</body>") >>> ET.tostring(e, 'ascii') Fails without the patch, behaves as expected with the patch.
msg89504 - (view)	Author: Neil Muller (Neil Muller)	Date: 2009-06-18 13:12
Updated patch - adds a test for this.
msg89505 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-06-18 13:25
This regression is probably annoying enough to make it a blocker.
msg89583 - (view)	Author: Fredrik Lundh (effbot) *	Date: 2009-06-21 21:33
Umm. Isn't _encode used to encode tags and attribute names? The charref syntax is only valid in CDATA sections and attribute values, which are encoded by the corresponding _escape functions. I suspect this patch will make things blow up on a non-ASCII tag/attribute name.
msg89585 - (view)	Author: Fredrik Lundh (effbot) *	Date: 2009-06-21 21:42
Did you look at the 1.3 alpha code base when you came up with this idea? Unfortunately, 1.3's _encode is used for a different purpose... I don't have time to test it tonight, but I suspect that 1.3's escape_data/escape_attrib functions might work better under 3.X; they do the text.replace dance first, and then an explicit text.encode(encoding, "xmlcharrefreplace") at the end. E.g. def _escape_cdata(text, encoding): # escape character data try: # it's worth avoiding do-nothing calls for strings that are # shorter than 500 character, or so. assume that's, by far, # the most common case in most applications. if "&" in text: text = text.replace("&", "&") if "<" in text: text = text.replace("<", "<") if ">" in text: text = text.replace(">", ">") return text.encode(encoding, "xmlcharrefreplace") except (TypeError, AttributeError): _raise_serialization_error(text)
msg89623 - (view)	Author: Jerry Chen (jcsalterego)	Date: 2009-06-23 05:37
The attached patch includes Neil's original additions to test_xml_etree.py. I also noticed that _encode_entity wasn't being called in ElementTree in py3k, with the important bit being the nested function escape_entities(), in conjunction with _escape and _escape_map. In 2.x, _encode_entity() is used after _encode() throws Unicode exceptions [1], so I figured it would make sense to take the core functionality of _escape_entities() and integrate it into _encode in the same fashion -- when an exception is thrown. Basically, I: - changed _escape regexp from using "[\x0080-\uffff]" to "[\x80-xff]" - extracted _encode_entity.escape_entities() and made it _escape_entities of module scope - removed _encode_entity() - added UnicodeEncodeError exception in _encode() I'm not sure what the expected outcome is supposed to be when the text is not type bytes but str. With this patch, the output has b"tÃ£t" rather than b"tãt". Hope this is a step in the right direction. [1] ElementTree.py:814, ElementTree.py:829, python 2.7 HEAD r50941
msg89684 - (view)	Author: Fredrik Lundh (effbot) *	Date: 2009-06-24 21:51
That's backwards, unless I'm missing something here: charrefs represent Unicode characters, not UTF-8 byte values. The character "LATIN SMALL LETTER A WITH TILDE" with the character value 227 should be represented as "ã" if serialized to an encoding that doesn't support non-ASCII characters. And there's no need to use RE:s to filter things under 3.X; those parts of ET 1.2 are there for pre-2.0 compatibility. Did you try running the tests with the escape function I posted?
msg89690 - (view)	Author: Jerry Chen (jcsalterego)	Date: 2009-06-24 23:56
Thanks for the explanation -- looks like I was way off base on that one. I took a look at the code you provided but it doesn't work as a drop-in replacement for _escape_cdata, since that function returns a string rather than bytes. However taking your code, calling it _encode_cdata and then refactoring all calls _encode(_escape_cdata(x), encoding) to _encode_cdata(x, encoding) seems to do the trick and passes the tests. Specific example: - file.write(_encode(_escape_cdata(node.text), encoding)) + file.write(_encode_cdata(node.text, encoding)) One minor modification is to return the string as is if encoding=None, just like _encode: def _encode_cdata(text, encoding): # escape character data try: text = text.replace("&", "&") text = text.replace("<", "<") text = text.replace(">", ">") if encoding: return text.encode(encoding, "xmlcharrefreplace") else: return text except (TypeError, AttributeError): _raise_serialization_error(text)
msg89715 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2009-06-25 21:26
effbot, do you have an opinion about the latest patch? It'd be nice to not have to delay the release for this.
msg89718 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2009-06-26 01:12
I disagree with this report being classified as release-critical - it is not a regression over 3.0 (i.e. 3.0 already behaved in the same way). That it is a regression relative to 2.x should not make it release-critical - we can still fix such regressions in 3.2. In addition, there is an easy work-around for applications that run into the problem - just use utf-8 as the output encoding always: py> e = ET.XML(b"<?xml version='1.0' encoding='iso-8859-1'?><body>t\xe3t</body>") py> ET.tostring(e,encoding='utf-8') b'<body>t\xc3\xa3t</body>'
msg89722 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2009-06-26 03:30
+1 for Py3.1.1
msg89728 - (view)	Author: Jerry Chen (jcsalterego)	Date: 2009-06-26 14:21
Either way, it would be nice to get feedback so we can iterate on the patch or close out this issue already :-)
msg95045 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-11-08 20:43
The patch looks ok to me.
msg99125 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-02-09 16:56
Committed in r78123 (py3k) and r78124 (3.1). I've also removed _escape_cdata() since it wasn't used anymore. Thanks Jerry for the patch.

History
Date	User	Action	Args
2022-04-11 14:56:49	admin	set	github: 50482
2010-02-09 16:56:25	pitrou	set	status: open -> closed resolution: fixed messages: + msg99125 stage: resolved
2009-11-08 20:43:14	pitrou	set	messages: + msg95045
2009-06-26 14:21:19	jcsalterego	set	messages: + msg89728
2009-06-26 11:13:39	pitrou	set	priority: release blocker -> critical versions: + Python 3.2, - Python 3.0
2009-06-26 03:30:55	rhettinger	set	nosy: + rhettinger messages: + msg89722
2009-06-26 01:12:28	loewis	set	nosy: + loewis messages: + msg89718
2009-06-25 21:26:50	benjamin.peterson	set	nosy: + benjamin.peterson messages: + msg89715
2009-06-24 23:56:04	jcsalterego	set	files: + issue6233-encode_cdata.diff messages: + msg89690
2009-06-24 21:51:25	effbot	set	messages: + msg89684
2009-06-23 05:37:25	jcsalterego	set	files: + issue6233-escape_entities.diff nosy: + jcsalterego messages: + msg89623
2009-06-21 21:42:03	effbot	set	messages: + msg89585
2009-06-21 21:33:01	effbot	set	messages: + msg89583
2009-06-18 13:25:14	pitrou	set	priority: release blocker nosy: + pitrou messages: + msg89505
2009-06-18 13:12:25	Neil Muller	set	files: + issue6233_py3k_with_test.diff messages: + msg89504
2009-06-12 12:50:24	Neil Muller	set	messages: + msg89276
2009-06-07 21:47:42	Neil Muller	set	files: + issue6233_py3k.diff keywords: + patch messages: + msg89059
2009-06-07 21:30:58	Neil Muller	create