classification
Title: Duplicate UTF-16 BOM if a file is open in append mode
Type: Stage:
Components: Versions: Python 3.0, Python 3.1, Python 2.7, Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Conrad.Irwin, amaury.forgeotdarc, benjamin.peterson, ezio.melotti, flox, lemburg, pitrou, vstinner
Priority: normal Keywords: patch

Created on 2009-01-19 23:24 by vstinner, last changed 2010-07-28 10:18 by flox. This issue is now closed.

Files
File name Uploaded Description Edit
append_bom-2.patch vstinner, 2009-03-26 17:52
append_bom-3.patch pitrou, 2009-03-28 18:53
append_bom-4.patch pitrou, 2009-05-10 22:52
Messages (19)
msg80221 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-01-19 23:24
Copy/paste of message79330 from the issue #4862:
--------------
>>> f = open('utf16.txt', 'w', encoding='utf-16')
>>> f.write('abc')
3
>>> f.close()

>>> f = open('utf16.txt', 'a', encoding='utf-16')
>>> f.write('def')
3
>>> f.close()
>>> open('utf16.txt', 'r', encoding='utf-16').read()
'abc\ufeffdef'
--------------
msg80223 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-01-19 23:37
Bug is reproductible with:
 * Python 2.5 : charset utf-8-sig and utf-16 for codecs.open()
 * trunk : charset utf-8-sig, utf-16 and utf-32 for codecs.open()
 * py3k : charset utf-8-sig, utf-16 and utf-32 for open()

With utf-7 or utf-8, no BOM is written.

Note: with UTF-32, the BOM is 4 bytes long (0xff 0xfe 0x00 0x00 on 
little endian) but it's still the character (BOM) \ufeff (little 
endian).
msg80303 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-01-21 00:41
See also issues #5008 (f.tell()) and #5016 (f.seekable()), not 
directly related to this issue.
msg83554 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-14 00:49
One possible solution would be to call tell() in the TextIOWrapper
constructor, and then decoder.setstate((b"", 0)) if the current pos is >0.
msg83834 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-03-20 00:37
@pitrou: You're right, but the state have to be changed for the 
encoder, not the decoder. I added the following code to TextIOWrapper 
constructor (for the C and the Python version of io library):

        if self._seekable and self.writable():
            position = self.buffer.tell()
            if position != 0:
                self._encoder = self._get_encoder()
                self._encoder.setstate(0)
msg84185 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-26 16:55
As for the C version of the patch:
- I'm not sure why you make self->ok handling more complicated than it was
- encodefunc should not be forced to NULL, otherwise it will yield a big
decrease in write() performance
msg84188 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-03-26 17:26
> - I'm not sure why you make self->ok handling more complicated than it was

tell() requires self->ok=1. I choosed to reset self->ok to zero on error, but 
it's maybe useless.

> - encodefunc should not be forced to NULL, otherwise it will yield a big
> decrease in write() performance

self>encodefunc value have to be changed because <encoder>.setstate() changes 
the encoding function: UTF-16 => UTF-16-LE or UTF-16-BE.

I don't know how to get the new value of self>encodefunc. I will try to write 
another patch.
msg84193 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-26 17:41
Le jeudi 26 mars 2009 à 17:26 +0000, STINNER Victor a écrit :
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
> > - I'm not sure why you make self->ok handling more complicated than it was
> 
> tell() requires self->ok=1. I choosed to reset self->ok to zero on error, but 
> it's maybe useless.

You are calling tell() on the underlying binary buffer, not on the
TextIOWrapper object. So self->ok should have no impact. Am I missing
something?

> self>encodefunc value have to be changed because <encoder>.setstate() changes 
> the encoding function: UTF-16 => UTF-16-LE or UTF-16-BE.

I know, but then the logic should probably be changed and use an
additional struct member to signal that the start of file has been
skipped.
msg84194 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-03-26 17:52
Faster patch!
 - add fast encoder for UTF-32, UTF-32-LE and UTF-32-BE (copy/paste of 
utf16 functions)
 - move utf-8 before utf-16-* because utf8 more popular than utf16 :-p
 - don't set self->encodefunc=NULL (loose all the encoder 
optimisation), but only fix self->encodefunc for two special cases: 
utf16=>utf16le or utf16be and utf32=>utf32le or utf32be
 - remove self->ok: it was may be usefull for an older version of my 
patch, but it's not more needed
msg84195 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-03-26 17:57
Hum, it's a detail, but is it a good idea to keep the reference the int(0)? Or 
would it be better to release it at exit?
msg84319 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-28 18:07
It's better now, although I think it's not good to duplicate the
encoding switch logic. It would be better to have a separate flag
indicate whether it's the start of stream or not. I'm gonna produce a
new patch, unless you beat me to it.

Also, I'm adding Amaury to the nosy list so that he tells us whether he
thinks the approach is sound.
msg84321 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-28 18:53
Here is a new patch catering to more cases (seek()) in addition to just
opening in append mode.
msg84373 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-03-29 10:58
seek() has also the problem? It's really hard to encode UTF-16/32
correctly...
msg87554 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-05-10 22:52
Updated patch against py3k.
msg87748 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-05-14 17:50
If no-one objects, I'm going to commit this in a couple of days.
msg87753 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2009-05-14 18:43
As I said on IRC a few days ago, I think the patch is ready to go.
msg87754 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-05-14 18:56
Ok, committed in r72635.
msg109014 - (view) Author: Conrad.Irwin (Conrad.Irwin) Date: 2010-06-30 20:20
Shouldn't this fix be back-ported to the 2.6 branch too?
msg111763 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-07-28 02:00
I fixed #6213 in 2.6 and 2.7, and so it's now possible to backport this fix to 2.6 => r83200
History
Date User Action Args
2010-07-28 10:18:37floxsetnosy: + flox
2010-07-28 02:00:02vstinnersetmessages: + msg111763
2010-06-30 20:20:25Conrad.Irwinsetnosy: + Conrad.Irwin
messages: + msg109014
2009-05-14 18:56:23pitrousetstatus: open -> closed
resolution: fixed
messages: + msg87754
2009-05-14 18:43:09benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg87753
2009-05-14 17:52:45ezio.melottisetnosy: + ezio.melotti
2009-05-14 17:50:50pitrousetmessages: + msg87748
2009-05-10 22:52:27pitrousetfiles: + append_bom-4.patch

messages: + msg87554
2009-05-01 21:50:41pitrousetnosy: + lemburg
2009-03-29 10:58:08vstinnersetmessages: + msg84373
2009-03-28 18:53:10pitrousetfiles: + append_bom-3.patch

messages: + msg84321
2009-03-28 18:07:14pitrousetnosy: + amaury.forgeotdarc
messages: + msg84319
2009-03-26 17:57:49vstinnersetmessages: + msg84195
2009-03-26 17:52:52vstinnersetfiles: - append_bom.patch
2009-03-26 17:52:40vstinnersetfiles: + append_bom-2.patch

messages: + msg84194
2009-03-26 17:41:34pitrousetmessages: + msg84193
2009-03-26 17:26:28vstinnersetmessages: + msg84188
2009-03-26 16:55:37pitrousetmessages: + msg84185
2009-03-20 00:37:44vstinnersetfiles: + append_bom.patch
keywords: + patch
messages: + msg83834
2009-03-14 00:49:57pitrousetmessages: + msg83554
2009-02-28 17:25:56benjamin.petersonlinkissue4565 dependencies
2009-01-21 00:41:46vstinnersetmessages: + msg80303
2009-01-20 00:12:20pitrousetnosy: + pitrou
2009-01-19 23:37:22vstinnersetmessages: + msg80223
versions: + Python 2.6, Python 2.7
2009-01-19 23:24:35vstinnercreate