classification
Title: UnicodeDecodeError in ElementTree.tostring()
Type: behavior Stage: needs patch
Components: XML Versions: Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, flox, uis
Priority: normal Keywords:

Created on 2010-08-26 14:42 by uis, last changed 2011-01-11 13:33 by uis.

Messages (6)
msg114980 - (view) Author: Ulrich Seidl (uis) Date: 2010-08-26 14:42
The following code leads to an UnicodeError in python 2.7 while it works fine in 2.6 & 2.5:

# -*- coding: latin-1 -*-
import xml.etree.cElementTree as ElementTree

oDoc = ElementTree.fromstring(
    '<?xml version="1.0" encoding="iso-8859-1"?><ROOT/>' )
oDoc.set( "ATTR", "ÄÖÜ" )
print ElementTree.tostring( oDoc , encoding="iso-8859-1" )
msg114984 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-08-26 15:17
IMO the code is not correct: how does ElementTree know which encoding is used for the attribute value?  Even 2.5 prints a different content when the script is saved with a different encoding.

The line should look like:
    oDoc.set( "ATTR", u"ÄÖÜ" )
or use ascii-only characters.
msg115002 - (view) Author: Ulrich Seidl (uis) Date: 2010-08-26 16:21
Of course, if you use an unicode string it works and of course it would be easy to switch to unicode for this demo code. Unfortunately, the affected application is a little bit more complex and it is not that easy to switch to unicode. I just wonder why the tostring() method does not assume that internal strings are encoded in the explicitly provided encoding? Is ElementTree restricted to the use of unicode strings? Anyway, why was it working (as expected) with python 2.5 & python 2.6?
msg115003 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-08-26 16:26
Testing with python 2.5: oDoc.set("ATTR", "ÄÖÜ") uses the encoding used by the source code (with "# -*- coding:";) If I use utf-8 instead, the output is:
   <ROOT ATTR="&#195;&#132;&#195;&#150;&#195;&#156;" />
which contains the numbers of the 3 pairs of surrogates.
msg115012 - (view) Author: Ulrich Seidl (uis) Date: 2010-08-26 17:59
Well, the output of the print is not that interesting as long as ElementTree is able the restore the former attributes value when reading it in again. The print was just used to illustrate that an UnicodeDecodeError appears. Think about doing an 
ElementTree.fromstring( ... ).get( "ATTR" ).encode( "iso-8859-1" ).
msg126005 - (view) Author: Ulrich Seidl (uis) Date: 2011-01-11 13:33
I would suggest adding an additional except branch to (at least) the following functions of ElementTree.py:
* _encode,
* _escape_attrib, and
* _escape_cdata 

The except branch could look like:

except (UnicodeDecodeError):
    return text.decode( encoding ).encode( encoding, "xmlcharrefreplace")
History
Date User Action Args
2011-01-11 13:33:53uissetmessages: + msg126005
2010-08-26 17:59:21uissetmessages: + msg115012
2010-08-26 16:26:50amaury.forgeotdarcsetmessages: + msg115003
2010-08-26 16:21:04uissetmessages: + msg115002
2010-08-26 15:18:00amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg114984
2010-08-26 14:47:14brian.curtinsetnosy: + flox

type: behavior
stage: needs patch
2010-08-26 14:42:54uiscreate