Issue 9692: UnicodeDecodeError in ElementTree.tostring()

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/53901

classification

Title:	UnicodeDecodeError in ElementTree.tostring()
Type:	behavior	Stage:	needs patch
Components:	XML	Versions:	Python 2.7

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:		Nosy List:	amaury.forgeotdarc, eli.bendersky, flox, uis
Priority:	normal	Keywords:

Created on 2010-08-26 14:42 by uis, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (7)
msg114980 - (view)	Author: Ulrich Seidl (uis)	Date: 2010-08-26 14:42
The following code leads to an UnicodeError in python 2.7 while it works fine in 2.6 & 2.5: # -- coding: latin-1 -- import xml.etree.cElementTree as ElementTree oDoc = ElementTree.fromstring( '<?xml version="1.0" encoding="iso-8859-1"?><ROOT/>' ) oDoc.set( "ATTR", "ÄÖÜ" ) print ElementTree.tostring( oDoc , encoding="iso-8859-1" )
msg114984 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-08-26 15:17
IMO the code is not correct: how does ElementTree know which encoding is used for the attribute value? Even 2.5 prints a different content when the script is saved with a different encoding. The line should look like: oDoc.set( "ATTR", u"ÄÖÜ" ) or use ascii-only characters.
msg115002 - (view)	Author: Ulrich Seidl (uis)	Date: 2010-08-26 16:21
Of course, if you use an unicode string it works and of course it would be easy to switch to unicode for this demo code. Unfortunately, the affected application is a little bit more complex and it is not that easy to switch to unicode. I just wonder why the tostring() method does not assume that internal strings are encoded in the explicitly provided encoding? Is ElementTree restricted to the use of unicode strings? Anyway, why was it working (as expected) with python 2.5 & python 2.6?
msg115003 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-08-26 16:26
Testing with python 2.5: oDoc.set("ATTR", "ÄÖÜ") uses the encoding used by the source code (with "# -*- coding:";) If I use utf-8 instead, the output is: <ROOT ATTR="ÃÃÃ" /> which contains the numbers of the 3 pairs of surrogates.
msg115012 - (view)	Author: Ulrich Seidl (uis)	Date: 2010-08-26 17:59
Well, the output of the print is not that interesting as long as ElementTree is able the restore the former attributes value when reading it in again. The print was just used to illustrate that an UnicodeDecodeError appears. Think about doing an ElementTree.fromstring( ... ).get( "ATTR" ).encode( "iso-8859-1" ).
msg126005 - (view)	Author: Ulrich Seidl (uis)	Date: 2011-01-11 13:33
I would suggest adding an additional except branch to (at least) the following functions of ElementTree.py: * _encode, * _escape_attrib, and * _escape_cdata The except branch could look like: except (UnicodeDecodeError): return text.decode( encoding ).encode( encoding, "xmlcharrefreplace")
msg166023 - (view)	Author: Florent Xicluna (flox) *	Date: 2012-07-21 13:33
I propose to close this as won't fix. The upgrade to ElementTree 1.3 brought some consistency when dealing with Unicode and encodings. The reported behavior was only seen in Python 2.7, when using bytes improperly.

History
Date	User	Action	Args
2022-04-11 14:57:05	admin	set	github: 53901
2012-07-21 13:33:29	flox	set	status: open -> closed nosy: + eli.bendersky messages: + msg166023 resolution: wont fix
2011-01-11 13:33:53	uis	set	messages: + msg126005
2010-08-26 17:59:21	uis	set	messages: + msg115012
2010-08-26 16:26:50	amaury.forgeotdarc	set	messages: + msg115003
2010-08-26 16:21:04	uis	set	messages: + msg115002
2010-08-26 15:18:00	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarc messages: + msg114984
2010-08-26 14:47:14	brian.curtin	set	nosy: + flox type: behavior stage: needs patch
2010-08-26 14:42:54	uis	create