This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: ElementTree.fromstring non-deterministically gives unicode text data
Type: behavior Stage: resolved
Components: XML Versions: Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: Brendan.OConnor, eli.bendersky, r.david.murray, scoder
Priority: normal Keywords:

Created on 2013-06-19 21:21 by Brendan.OConnor, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (5)
msg191496 - (view) Author: Brendan O'Connor (Brendan.OConnor) Date: 2013-06-19 21:21
(This is Python 2.7 so I'm using string vs unicode terminology.)

When I use ElementTree.fromstring(), and use the .text field on nodes, the value is usually a string object, but in rare cases it's a unicode object.  I'm parsing many XML documents of newspaper text [1]; on one subset of the data, out of 5 million nodes, ~200 of them have a unicode object for the .text field.

I think this is all related to http://bugs.python.org/issue11033 but I can't figure out how, exactly.  I'm passing in strings to ElementTree.fromstring() like you're supposed to.

The workaround is to defensively convert the .text value to unicode [3].

[1] data is http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2012T21

[2] my processing code is https://github.com/brendano/gigaword_conversion/blob/master/annogw2justsent.py

[3]

def convert_to_unicode(mystr):
    if isinstance(mystr, unicode):
        return mystr
    if isinstance(mystr, str):
        return mystr.decode('utf8')
msg191500 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-06-20 04:38
This kind of thing is why python3 exists.  Presumably some bit of the elementree code is successfully converting non-ascii into unicode, and then when that is mixed with the result it is returning, you end up with unicode.  But that is just a guess; you'll have to dig into a specific example to figure out why it is happening.

Or when you say non-deterministically, do you mean that you have tried to reproduce it with a specific entry that returns unicode in the full run and it does not reproduce?  Although even in that case it might be due to some complex interaction in the non-reduced code...

It might be interesting to run it under python3 and see if anything odd happens there...but it will probably just work.
msg191501 - (view) Author: Brendan O'Connor (Brendan.OConnor) Date: 2013-06-20 04:49
By "non-deterministic" I just mean that the conversion happens for some data but not other data.  I should try to find examples that causes it to happen.
msg194318 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2013-08-04 01:00
I'm not sure what the issue here is, exactly. Python 2.7 is known for implicit conversions between ascii and unicode, and this appears to be an artifact of your data. Note that Python 2.7 only gets fixes for serious bugs at this point.

Can you reproduce this problem with Python 3.3? More generally, can you provide a small reproducer? Without this I don't think this is a constructive report, and will close the issue in a few days.
msg194900 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2013-08-11 16:12
Rejecting this ticket was the right thing to do. It's not a bug but a feature. In Python 2.x, ElementTree returns any text content that can correctly be represented as an ASCII encoded string in the native Py2.x string type (i.e. 'str'). Only non-ASCII strings are returned as unicode values. So it's actually completely deterministic and predictable behaviour. Amongst other things, it saves memory.

Note that in Python 2.x, ASCII-only str values are compatible with unicode values and get promoted to unicode at need. If you want to make sure you always use unicode values, you can call "unicode(text)" on whatever you get back, but in practice, it's really not a problem.
History
Date User Action Args
2022-04-11 14:57:47adminsetgithub: 62468
2013-08-11 16:12:16scodersetnosy: + scoder
messages: + msg194900
2013-08-10 12:55:54eli.benderskysetstatus: open -> closed
resolution: not a bug
stage: resolved
2013-08-04 01:00:05eli.benderskysetmessages: + msg194318
2013-06-20 04:49:51Brendan.OConnorsetmessages: + msg191501
2013-06-20 04:39:40r.david.murraysetnosy: + eli.bendersky
2013-06-20 04:38:36r.david.murraysetnosy: + r.david.murray
messages: + msg191500
components: + XML
2013-06-19 21:21:54Brendan.OConnorcreate