Message191496
(This is Python 2.7 so I'm using string vs unicode terminology.)
When I use ElementTree.fromstring(), and use the .text field on nodes, the value is usually a string object, but in rare cases it's a unicode object. I'm parsing many XML documents of newspaper text [1]; on one subset of the data, out of 5 million nodes, ~200 of them have a unicode object for the .text field.
I think this is all related to http://bugs.python.org/issue11033 but I can't figure out how, exactly. I'm passing in strings to ElementTree.fromstring() like you're supposed to.
The workaround is to defensively convert the .text value to unicode [3].
[1] data is http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2012T21
[2] my processing code is https://github.com/brendano/gigaword_conversion/blob/master/annogw2justsent.py
[3]
def convert_to_unicode(mystr):
if isinstance(mystr, unicode):
return mystr
if isinstance(mystr, str):
return mystr.decode('utf8') |
|
Date |
User |
Action |
Args |
2013-06-19 21:21:54 | Brendan.OConnor | set | recipients:
+ Brendan.OConnor |
2013-06-19 21:21:54 | Brendan.OConnor | set | messageid: <1371676914.3.0.530507394222.issue18268@psf.upfronthosting.co.za> |
2013-06-19 21:21:54 | Brendan.OConnor | link | issue18268 messages |
2013-06-19 21:21:54 | Brendan.OConnor | create | |
|