Message 75864 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	rurwin
Recipients	bugok, effbot, nnorwitz, rurwin
Date	2008-11-14.15:33:52
SpamBayes Score	4.3892964e-05
Marked as misclassified	No
Message-id	<1226676834.44.0.106446657024.issue1767933@psf.upfronthosting.co.za>
In-reply-to

Content
This is a bug in two halves. 1. Not all characters in the file are UTF-16. The initial xml header isn't, and the individual < > etc characters are not. This is just a matter of extending the methodology to encode all characters and not just the textual bits. There is no work-around except a five-minute hack of the ElementTree.write() method. 2. Every write has a BOM, so corrupting the file in a manner analogous to bug 555360. This is a result of using string.encode() and is a well-known feature. It can be worked around by using UTF-16LE or UTF-16BE which do not prepend a BOM, but then the file doesn't have any BOM. A complete solution would be to rewrite ElementTree.write() to use a different encoding methodology such as StreamWriter. I have made the above hack and work-around for my own use, and I can report that it produces perfect UTF-16.

This is a bug in two halves.

1. Not all characters in the file are UTF-16. The initial xml header
isn't, and the individual < > etc characters are not. This is just a
matter of extending the methodology to encode all characters and not
just the textual bits. There is no work-around except a five-minute hack
of the ElementTree.write() method.

2. Every write has a BOM, so corrupting the file in a manner analogous
to bug 555360. This is a result of using string.encode() and is a
well-known feature. It can be worked around by using UTF-16LE or
UTF-16BE which do not prepend a BOM, but then the file doesn't have any
BOM. A complete solution would be to rewrite ElementTree.write() to use
a different encoding methodology such as StreamWriter.

I have made the above hack and work-around for my own use, and I can
report that it produces perfect UTF-16.

History
Date	User	Action	Args
2008-11-14 15:33:54	rurwin	set	recipients: + rurwin, effbot, nnorwitz, bugok
2008-11-14 15:33:54	rurwin	set	messageid: <1226676834.44.0.106446657024.issue1767933@psf.upfronthosting.co.za>
2008-11-14 15:33:53	rurwin	link	issue1767933 messages
2008-11-14 15:33:52	rurwin	create