classification
Title: UnicodeDecodeError in ElementTree.tostring()
Type: behavior Stage: needs patch
Components: XML Versions: Python 2.7
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, eli.bendersky, flox, uis
Priority: normal Keywords:

Created on 2010-08-26 14:42 by uis, last changed 2012-07-21 13:33 by flox. This issue is now closed.

Messages (7)
msg114980 - (view) Author: Ulrich Seidl (uis) Date: 2010-08-26 14:42
The following code leads to an UnicodeError in python 2.7 while it works fine in 2.6 & 2.5:

# -*- coding: latin-1 -*-
import xml.etree.cElementTree as ElementTree

oDoc = ElementTree.fromstring(
    '<?xml version="1.0" encoding="iso-8859-1"?><ROOT/>' )
oDoc.set( "ATTR", "ÄÖÜ" )
print ElementTree.tostring( oDoc , encoding="iso-8859-1" )
msg114984 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-08-26 15:17
IMO the code is not correct: how does ElementTree know which encoding is used for the attribute value?  Even 2.5 prints a different content when the script is saved with a different encoding.

The line should look like:
    oDoc.set( "ATTR", u"ÄÖÜ" )
or use ascii-only characters.
msg115002 - (view) Author: Ulrich Seidl (uis) Date: 2010-08-26 16:21
Of course, if you use an unicode string it works and of course it would be easy to switch to unicode for this demo code. Unfortunately, the affected application is a little bit more complex and it is not that easy to switch to unicode. I just wonder why the tostring() method does not assume that internal strings are encoded in the explicitly provided encoding? Is ElementTree restricted to the use of unicode strings? Anyway, why was it working (as expected) with python 2.5 & python 2.6?
msg115003 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-08-26 16:26
Testing with python 2.5: oDoc.set("ATTR", "ÄÖÜ") uses the encoding used by the source code (with "# -*- coding:";) If I use utf-8 instead, the output is:
   <ROOT ATTR="&#195;&#132;&#195;&#150;&#195;&#156;" />
which contains the numbers of the 3 pairs of surrogates.
msg115012 - (view) Author: Ulrich Seidl (uis) Date: 2010-08-26 17:59
Well, the output of the print is not that interesting as long as ElementTree is able the restore the former attributes value when reading it in again. The print was just used to illustrate that an UnicodeDecodeError appears. Think about doing an 
ElementTree.fromstring( ... ).get( "ATTR" ).encode( "iso-8859-1" ).
msg126005 - (view) Author: Ulrich Seidl (uis) Date: 2011-01-11 13:33
I would suggest adding an additional except branch to (at least) the following functions of ElementTree.py:
* _encode,
* _escape_attrib, and
* _escape_cdata 

The except branch could look like:

except (UnicodeDecodeError):
    return text.decode( encoding ).encode( encoding, "xmlcharrefreplace")
msg166023 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2012-07-21 13:33
I propose to close this as won't fix.

The upgrade to ElementTree 1.3 brought some consistency when dealing with Unicode and encodings.

The reported behavior was only seen in Python 2.7, when using bytes improperly.
History
Date User Action Args
2012-07-21 13:33:29floxsetstatus: open -> closed

nosy: + eli.bendersky
messages: + msg166023

resolution: wont fix
2011-01-11 13:33:53uissetmessages: + msg126005
2010-08-26 17:59:21uissetmessages: + msg115012
2010-08-26 16:26:50amaury.forgeotdarcsetmessages: + msg115003
2010-08-26 16:21:04uissetmessages: + msg115002
2010-08-26 15:18:00amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg114984
2010-08-26 14:47:14brian.curtinsetnosy: + flox

type: behavior
stage: needs patch
2010-08-26 14:42:54uiscreate