Message 191496 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Brendan.OConnor
Recipients	Brendan.OConnor
Date	2013-06-19.21:21:54
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1371676914.3.0.530507394222.issue18268@psf.upfronthosting.co.za>
In-reply-to

Content
(This is Python 2.7 so I'm using string vs unicode terminology.) When I use ElementTree.fromstring(), and use the .text field on nodes, the value is usually a string object, but in rare cases it's a unicode object. I'm parsing many XML documents of newspaper text [1]; on one subset of the data, out of 5 million nodes, ~200 of them have a unicode object for the .text field. I think this is all related to http://bugs.python.org/issue11033 but I can't figure out how, exactly. I'm passing in strings to ElementTree.fromstring() like you're supposed to. The workaround is to defensively convert the .text value to unicode [3]. [1] data is http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2012T21 [2] my processing code is https://github.com/brendano/gigaword_conversion/blob/master/annogw2justsent.py [3] def convert_to_unicode(mystr): if isinstance(mystr, unicode): return mystr if isinstance(mystr, str): return mystr.decode('utf8')

(This is Python 2.7 so I'm using string vs unicode terminology.)

When I use ElementTree.fromstring(), and use the .text field on nodes, the value is usually a string object, but in rare cases it's a unicode object.  I'm parsing many XML documents of newspaper text [1]; on one subset of the data, out of 5 million nodes, ~200 of them have a unicode object for the .text field.

I think this is all related to http://bugs.python.org/issue11033 but I can't figure out how, exactly.  I'm passing in strings to ElementTree.fromstring() like you're supposed to.

The workaround is to defensively convert the .text value to unicode [3].

[1] data is http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2012T21

[2] my processing code is https://github.com/brendano/gigaword_conversion/blob/master/annogw2justsent.py

[3]

def convert_to_unicode(mystr):
    if isinstance(mystr, unicode):
        return mystr
    if isinstance(mystr, str):
        return mystr.decode('utf8')

History
Date	User	Action	Args
2013-06-19 21:21:54	Brendan.OConnor	set	recipients: + Brendan.OConnor
2013-06-19 21:21:54	Brendan.OConnor	set	messageid: <1371676914.3.0.530507394222.issue18268@psf.upfronthosting.co.za>
2013-06-19 21:21:54	Brendan.OConnor	link	issue18268 messages
2013-06-19 21:21:54	Brendan.OConnor	create