classification
Title: Badly formed XML using etree and utf-16
Type: behavior Stage: needs patch
Components: XML Versions: Python 3.2, Python 2.7
process
Status: open Resolution:
Dependencies: Update ElementTree with upstream changes
View: 6472
Superseder:
Assigned To: effbot Nosy List: BreamoreBoy, Richard.Urwin, amaury.forgeotdarc, bugok, effbot, flox, nnorwitz, rurwin
Priority: normal Keywords:

Created on 2007-08-05 15:01 by bugok, last changed 2010-10-02 09:56 by amaury.forgeotdarc.

Files
File name Uploaded Description Edit
patch.txt rurwin, 2008-11-14 17:32 patch to xml/etree/ElementTree.py
bug-test.py Richard.Urwin, 2010-07-26 13:24 demonstrator
Messages (13)
msg32587 - (view) Author: BugoK (bugok) Date: 2007-08-05 15:01
Hello,

The bug occurs when writing an XML file using the UTF-16 encoding.
The problem is that the etree encodes every string to utf-16 by itself - meaning, inserting the 0xfffe BOM before every string (tag, text, attribute name, etc.), causing a badly formed utf=16 strings.

A possible solution, which was offered by a co-worker of mine, was to use a utf-16 writer (from codecs.getwriter('utf-16') to write the file.

Best,

BugoK.
msg32588 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2007-08-07 05:54
Fredrik, could you take a look at this?
msg32589 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2007-08-07 06:20
ET's standard serializer currently only supports ASCII-compatible encodings.  See e.g.

    http://effbot.python-hosting.com/ticket/47

The best workaround for ET 1.2 (Python 2.5) is probably to serialize as "utf-8" and transcode:

    out = unicode(ET.tostring(elem), "utf-8").encode(...)
msg75864 - (view) Author: Richard Urwin (rurwin) Date: 2008-11-14 15:33
This is a bug in two halves.

1. Not all characters in the file are UTF-16. The initial xml header
isn't, and the individual < > etc characters are not. This is just a
matter of extending the methodology to encode all characters and not
just the textual bits. There is no work-around except a five-minute hack
of the ElementTree.write() method.

2. Every write has a BOM, so corrupting the file in a manner analogous
to bug 555360. This is a result of using string.encode() and is a
well-known feature. It can be worked around by using UTF-16LE or
UTF-16BE which do not prepend a BOM, but then the file doesn't have any
BOM. A complete solution would be to rewrite ElementTree.write() to use
a different encoding methodology such as StreamWriter.

I have made the above hack and work-around for my own use, and I can
report that it produces perfect UTF-16.
msg75866 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-11-14 15:48
Would you provide a patch?
msg75875 - (view) Author: Richard Urwin (rurwin) Date: 2008-11-14 17:32
Here is a patch of my quick hack, more for interest than any suggestion
it gets used. Although it does produce good output so long as you avoid
the BOM.

The full solution is beyond my (very weak) Python skills. The character
encoding is tied in with XML character substitution (&amp; etc. and
hexadecimal representation of multibyte characters). I could disentangle
it, but I probably wouldn't produce optimal Python, or indeed anything
that wouldn't inspire mirth and/or incredulity.

NB. The workaround suggested by Fredrik Lundh doesn't solve our
particular problems, since the downsize to UTF-8 causes the multi-byte
characters to be represented in hex. Our software doesn't read those. (I
know that's our problem.)
msg99394 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-02-16 11:43
Could you provide a test case, so we can check if the upgrade proposed on #6472 solves this issue?
msg111533 - (view) Author: Mark Lawrence (BreamoreBoy) Date: 2010-07-25 10:23
@Richard: Could you provide a test case for this, or do you consider it beyond your Python capabilities allowing for your comments on msg75875?
msg111608 - (view) Author: Richard Urwin (Richard.Urwin) Date: 2010-07-26 13:24
I can't produce an automated test, for want of time, but here is a demonstrator.

Grab the example XHTML from http://docs.python.org/library/xml.etree.elementtree.html#elementtree-objects or use some tiny ASCII-encoded xml file. Save it as "file.xml" in the same folder as bug-test.py attached here.

Execute bug-test.xml

file.xml is read and then written in UTF-16. The output file is then read and dumped to stdout as a byte-stream.

1. To be correct UTF-16, the output should start with 255 254, which should never occur in the rest of the file.

2. The rest of the output (including the first line) should alternate zeros with ASCII character codes.

3. The file output.xml should be loadable in a UTF16-capable text editor (eg jEdit), be recognised as UTF-16 and be identical in terms of content to file.xml
msg111611 - (view) Author: Richard Urwin (Richard.Urwin) Date: 2010-07-26 13:27
> Execute bug-test.xml

I meant bug-test.py, of course
msg111631 - (view) Author: Mark Lawrence (BreamoreBoy) Date: 2010-07-26 15:09
@Florent: is this something you could pick up, I think it's out of my league.
msg111635 - (view) Author: Richard Urwin (Richard.Urwin) Date: 2010-07-26 15:31
As an example, here is the first two lines of output when I use Python 2.6.3:
60 63 120 109 108 32 118 101 114 115 105 111 110 61 39 49 46 48 39 32 101 110 99 111 100 105 110 103 61 39 85 84 70 45 49 54 39 63 62 10
60 255 254 104 0 116 0 109 0 108 0 62 255 254 10

Note:
No 255 254 at the start of the file, but several within it.
No zeros interspersing the first line and the odd one missing thereafter.
msg117864 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-10-02 09:56
Python 3.1 improves the situation, the file looks more like utf-16, except that the BOM ("\xff\xfe") is repeated all the time, probably on every internal call to file.write().

Here is a test script that should work on both 2.7 and 3.1.

from io import BytesIO
from xml.etree.ElementTree import ElementTree
content = "<?xml version='1.0' encoding='UTF-16'?><html></html>"
input = BytesIO(content.encode('utf-16'))
tree = ElementTree()
tree.parse(input)
# Write content
output = BytesIO()
tree.write(output, encoding="utf-16")
assert output.getvalue().decode('utf-16') == content
History
Date User Action Args
2010-10-02 09:56:28amaury.forgeotdarcsetmessages: + msg117864
stage: test needed -> needs patch
2010-07-26 15:31:50Richard.Urwinsetmessages: + msg111635
2010-07-26 15:09:21BreamoreBoysetmessages: + msg111631
2010-07-26 13:27:27Richard.Urwinsetmessages: + msg111611
2010-07-26 13:24:27Richard.Urwinsetfiles: + bug-test.py
nosy: + Richard.Urwin
messages: + msg111608

2010-07-25 10:23:24BreamoreBoysetnosy: + BreamoreBoy
messages: + msg111533
2010-02-16 11:43:34floxsetnosy: + flox
versions: + Python 2.7, Python 3.2, - Python 2.6
messages: + msg99394

dependencies: + Update ElementTree with upstream changes
type: behavior
stage: test needed
2008-11-14 17:32:16rurwinsetfiles: + patch.txt
messages: + msg75875
2008-11-14 15:48:20amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg75866
2008-11-14 15:33:53rurwinsetnosy: + rurwin
messages: + msg75864
versions: + Python 2.6
2007-08-05 15:01:57bugokcreate