classification
Title: cElementTree.iterparse & ElementTree.iterparse return differently encoded strings
Type: behavior Stage: resolved
Components: XML Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder: Update ElementTree with upstream changes
View: 6472
Assigned To: Nosy List: Neil Muller, effbot, flox, jerith, nlopes
Priority: normal Keywords: patch

Created on 2009-06-11 10:53 by Neil Muller, last changed 2010-03-11 15:57 by flox. This issue is now closed.

Messages (9)
msg89248 - (view) Author: Neil Muller (Neil Muller) Date: 2009-06-11 10:53
Consider:

>>> from StringIO import StringIO
>>> source = StringIO('<body xmlns="http://&#233;ffbot.org/ns">text</body>')
>>> import xml.etree.ElementTree as ET
>>> events = ("start-ns",)
>>> context = ET.iterparse(source, events)
>>> for action, elem in context:
...    print action, elem
... 
start-ns ('', u'http://\xe9ffbot.org/ns')
>>> source.seek(0)
>>> import xml.etree.cElementTree as cET
>>> context = cET.iterparse(source, events)
>>> for action, elem in context:
...    print action, elem
... 
start-ns ('', 'http://\xc3\xa9ffbot.org/ns')

I'm not sure which is more correct here, but unsing different encodings
in the result is somewhat unexpected.
msg89550 - (view) Author: (nlopes) Date: 2009-06-20 23:39
This is a pretty dumb patch, but it does it's job.
Basically it decodes the utf-8 encoded prefix and uri. Then, encodes it
into Latin1. Probably there are better ways of doing this and those
ideas are welcome. Patch attached.
msg89551 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2009-06-21 00:15
Converting from UTF-8 to Unicode is the right thing to do, but 
converting back to Latin-1 is not correct -- note that ET returns a 
Unicode string, not an 8-bit string.  There's a "makestring" helper that 
does the right thing in the library; just changing:

parcel = Py_BuildValue("ss", (prefix) ? prefix : "", uri);

to 

parcel = Py_BuildValue("sN", (prefix) ? prefix : "", makestring(uri));

should work (even if you should probably do that in two steps, and look 
for errors from makestring before proceeding).
msg89552 - (view) Author: (nlopes) Date: 2009-06-21 00:42
You're right about the conversion to Latin1.
I actually played a bit with makestring before going in another
direction (although not very good) because makestring alone wasn't
giving what is intended.

I'll try to check tomorrow a good approach for this (already had that in
mind).
msg89560 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2009-06-21 13:12
It should definitely give what's intended (either a Unicode string, or, if 
the content is plain ASCII, an 8-bit string).  What did you get instead?
msg89564 - (view) Author: (nlopes) Date: 2009-06-21 17:24
I got pure gibberish output, but I know why. It was a compilation gone
wrong.

To get the output as ElementTree, I think instead of 

parcel = Py_BuildValue("sN", (prefix) ? prefix : "", makestring(uri));

it should be

parcel = Py_BuildValue("sN", (prefix) ? prefix : "",
PyUnicode_AsUnicode(makestring(uri), strlen(uri)));

Else it will not be the expected result.

Or am I overseeing something?
msg89568 - (view) Author: (nlopes) Date: 2009-06-21 17:41
Don't mind what I just said. I overlooked the N. I couldn't figure out
what was going wrong with your solution.

That works. Mine is a ... aham.

:)
msg99442 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-02-16 21:58
Merged with the upstream patch proposed on #6472 (with test case).
msg100871 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-03-11 15:57
Fixed with #6472.
History
Date User Action Args
2010-03-11 15:57:40floxsetstatus: open -> closed
superseder: Update ElementTree with upstream changes
messages: + msg100871

dependencies: - Update ElementTree with upstream changes
resolution: fixed
stage: needs patch -> resolved
2010-02-16 21:58:35floxsetdependencies: + Update ElementTree with upstream changes
messages: + msg99442
2010-02-16 13:23:15floxsetnosy: + flox
priority: normal
components: + XML
type: behavior
stage: needs patch
2009-06-21 17:42:45nlopessetfiles: - _elementtree.diff
2009-06-21 17:41:47nlopessetmessages: + msg89568
2009-06-21 17:24:35nlopessetmessages: + msg89564
2009-06-21 13:12:43effbotsetmessages: + msg89560
2009-06-21 00:42:26nlopessetmessages: + msg89552
2009-06-21 00:15:37effbotsetmessages: + msg89551
2009-06-20 23:39:22nlopessetfiles: + _elementtree.diff

nosy: + nlopes
messages: + msg89550

keywords: + patch
2009-06-11 10:53:51Neil Mullercreate