classification
Title: ElementTree.fromstring doesn't work with Unicode
Type: enhancement Stage: resolved
Components: XML Versions: Python 2.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: belopolsky Nosy List: Brendan.OConnor, Peter.Cai, belopolsky, vstinner
Priority: normal Keywords:

Created on 2011-01-28 00:07 by Peter.Cai, last changed 2013-08-04 18:22 by belopolsky. This issue is now closed.

Messages (5)
msg127239 - (view) Author: Peter Cai (Peter.Cai) Date: 2011-01-28 00:07
xml.etree.ElementTree.fromstring doesn't work with Unicode string.  See the code below:

>>> from xml.etree import ElementTree
>>> t = ElementTree.fromstring(u'<doc>诗</doc>')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\Python26\lib\xml\etree\ElementTree.py", line 963, in XML
    parser.feed(text)
  File "D:\Python26\lib\xml\etree\ElementTree.py", line 1245, in feed
    self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u8bd7' in position 5
: ordinal not in range(128)
msg127740 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2011-02-02 16:08
This works in 3.x:

Python 3.2rc2+ (py3k:88279:88280, Feb  1 2011, 00:01:52)
..
>>> from xml.etree import ElementTree
>>> ElementTree.fromstring('<doc>诗</doc>')
<Element 'doc' at 0x1007daa00>

In 2.x you need to encode unicode strings before passing them to ElementTree.fromstring().  For example:

----
# encoding: utf-8                                                                                                                                                      
from xml.etree import ElementTree
t = ElementTree.fromstring(u'<doc>诗</doc>'.encode('utf-8'))
print t.text
----

This is not a bug because fromstring() unlike some other ElementTree methods is not documented to support unicode strings. Since 2.x is closed for new features, this has to be rejected.
msg127741 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-02-02 16:10
> Since 2.x is closed for new features, this has to be rejected.

We can explain in ElementTree documentation how to pass non-ASCII unicode strings: using explicit encoding to UTF-8.
msg127742 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2011-02-02 16:33
On Wed, Feb 2, 2011 at 11:10 AM, STINNER Victor <report@bugs.python.org> wrote:
..
> We can explain in ElementTree documentation how to pass non-ASCII unicode strings: using
> explicit encoding to UTF-8.

ElementTree.fromstring() ultimately calls ElementTree.XMLParser.feed()
which is documented as follows:

"""
feed(data)
Feeds data to the parser. data is encoded data.
"""

Maybe we can simply add "encoded" to the description of
ElementTree.fromstring()  argument as well?
msg194329 - (view) Author: Brendan O'Connor (Brendan.OConnor) Date: 2013-08-04 07:38
Sure, go ahead and close it.  I was just trying to be helpful and report a bug in the Python standard library.  I don't use Python 3.3 so cannot test it.
History
Date User Action Args
2013-08-04 18:22:59belopolskysetstatus: open -> closed
2013-08-04 07:38:44Brendan.OConnorsetnosy: + Brendan.OConnor
messages: + msg194329
2011-02-02 16:33:08belopolskysetnosy: belopolsky, vstinner, Peter.Cai
messages: + msg127742
2011-02-02 16:10:51vstinnersetstatus: pending -> open
nosy: + vstinner
messages: + msg127741

2011-02-02 16:08:43belopolskysetstatus: open -> pending

type: crash -> enhancement
assignee: belopolsky
versions: + Python 2.7, - Python 2.6
nosy: + belopolsky

messages: + msg127740
resolution: rejected
stage: resolved
2011-01-28 00:07:35Peter.Caicreate