Duplicate UTF-16 BOM if a file is open in append mode #49256

vstinner · 2009-01-19T23:24:35Z

BPO	5006
Nosy	@malemburg, @amauryfa, @pitrou, @vstinner, @benjaminp, @ezio-melotti, @florentx
Files	append_bom-2.patch append_bom-3.patch append_bom-4.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2009-05-14.18:56:23.019>
created_at = <Date 2009-01-19.23:24:35.284>
labels = []
title = 'Duplicate UTF-16 BOM if a file is open in append mode'
updated_at = <Date 2010-07-28.10:18:37.955>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2010-07-28.10:18:37.955>
actor = 'flox'
assignee = 'none'
closed = True
closed_date = <Date 2009-05-14.18:56:23.019>
closer = 'pitrou'
components = []
creation = <Date 2009-01-19.23:24:35.284>
creator = 'vstinner'
dependencies = []
files = ['13421', '13445', '13953']
hgrepos = []
issue_num = 5006
keywords = ['patch']
message_count = 19.0
messages = ['80221', '80223', '80303', '83554', '83834', '84185', '84188', '84193', '84194', '84195', '84319', '84321', '84373', '87554', '87748', '87753', '87754', '109014', '111763']
nosy_count = 8.0
nosy_names = ['lemburg', 'amaury.forgeotdarc', 'pitrou', 'vstinner', 'benjamin.peterson', 'ezio.melotti', 'Conrad.Irwin', 'flox']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue5006'
versions = ['Python 2.6', 'Python 3.0', 'Python 3.1', 'Python 2.7']

vstinner · 2009-01-19T23:24:35Z

Copy/paste of message79330 from the issue bpo-4862:
--------------

>>> f = open('utf16.txt', 'w', encoding='utf-16')
>>> f.write('abc')
3
>>> f.close()

>>> f = open('utf16.txt', 'a', encoding='utf-16')
>>> f.write('def')
3
>>> f.close()
>>> open('utf16.txt', 'r', encoding='utf-16').read()
'abc\ufeffdef'

vstinner · 2009-01-19T23:37:22Z

Bug is reproductible with:

Python 2.5 : charset utf-8-sig and utf-16 for codecs.open()
trunk : charset utf-8-sig, utf-16 and utf-32 for codecs.open()
py3k : charset utf-8-sig, utf-16 and utf-32 for open()

With utf-7 or utf-8, no BOM is written.

Note: with UTF-32, the BOM is 4 bytes long (0xff 0xfe 0x00 0x00 on
little endian) but it's still the character (BOM) \ufeff (little
endian).

vstinner · 2009-01-21T00:41:47Z

See also issues bpo-5008 (f.tell()) and bpo-5016 (f.seekable()), not
directly related to this issue.

pitrou · 2009-03-14T00:49:58Z

One possible solution would be to call tell() in the TextIOWrapper
constructor, and then decoder.setstate((b"", 0)) if the current pos is >0.

vstinner · 2009-03-20T00:37:43Z

@pitrou: You're right, but the state have to be changed for the
encoder, not the decoder. I added the following code to TextIOWrapper
constructor (for the C and the Python version of io library):

        if self._seekable and self.writable():
            position = self.buffer.tell()
            if position != 0:
                self._encoder = self._get_encoder()
                self._encoder.setstate(0)

pitrou · 2009-03-26T16:55:37Z

As for the C version of the patch:

I'm not sure why you make self->ok handling more complicated than it was
encodefunc should not be forced to NULL, otherwise it will yield a big
decrease in write() performance

vstinner · 2009-03-26T17:26:28Z

I'm not sure why you make self->ok handling more complicated than it was

tell() requires self->ok=1. I choosed to reset self->ok to zero on error, but
it's maybe useless.

encodefunc should not be forced to NULL, otherwise it will yield a big
decrease in write() performance

self>encodefunc value have to be changed because <encoder>.setstate() changes
the encoding function: UTF-16 => UTF-16-LE or UTF-16-BE.

I don't know how to get the new value of self>encodefunc. I will try to write
another patch.

pitrou · 2009-03-26T17:41:34Z

Le jeudi 26 mars 2009 à 17:26 +0000, STINNER Victor a écrit :

STINNER Victor <victor.stinner@haypocalc.com> added the comment:

> - I'm not sure why you make self->ok handling more complicated than it was

tell() requires self->ok=1. I choosed to reset self->ok to zero on error, but
it's maybe useless.

You are calling tell() on the underlying binary buffer, not on the
TextIOWrapper object. So self->ok should have no impact. Am I missing
something?

self>encodefunc value have to be changed because <encoder>.setstate() changes
the encoding function: UTF-16 => UTF-16-LE or UTF-16-BE.

I know, but then the logic should probably be changed and use an
additional struct member to signal that the start of file has been
skipped.

vstinner · 2009-03-26T17:52:39Z

Faster patch!

add fast encoder for UTF-32, UTF-32-LE and UTF-32-BE (copy/paste of
utf16 functions)
move utf-8 before utf-16-* because utf8 more popular than utf16 :-p
don't set self->encodefunc=NULL (loose all the encoder
optimisation), but only fix self->encodefunc for two special cases:
utf16=>utf16le or utf16be and utf32=>utf32le or utf32be
remove self->ok: it was may be usefull for an older version of my
patch, but it's not more needed

vstinner · 2009-03-26T17:57:49Z

Hum, it's a detail, but is it a good idea to keep the reference the int(0)? Or
would it be better to release it at exit?

pitrou · 2009-03-28T18:07:14Z

It's better now, although I think it's not good to duplicate the
encoding switch logic. It would be better to have a separate flag
indicate whether it's the start of stream or not. I'm gonna produce a
new patch, unless you beat me to it.

Also, I'm adding Amaury to the nosy list so that he tells us whether he
thinks the approach is sound.

pitrou · 2009-03-28T18:53:04Z

Here is a new patch catering to more cases (seek()) in addition to just
opening in append mode.

vstinner · 2009-03-29T10:58:08Z

seek() has also the problem? It's really hard to encode UTF-16/32
correctly...

pitrou · 2009-05-10T22:52:24Z

Updated patch against py3k.

pitrou · 2009-05-14T17:50:50Z

If no-one objects, I'm going to commit this in a couple of days.

benjaminp · 2009-05-14T18:43:09Z

As I said on IRC a few days ago, I think the patch is ready to go.

pitrou · 2009-05-14T18:56:23Z

Ok, committed in r72635.

ConradIrwin · 2010-06-30T20:20:25Z

Shouldn't this fix be back-ported to the 2.6 branch too?

vstinner · 2010-07-28T02:00:02Z

I fixed bpo-6213 in 2.6 and 2.7, and so it's now possible to backport this fix to 2.6 => r83200

pitrou closed this as completed May 14, 2009

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate UTF-16 BOM if a file is open in append mode #49256

Duplicate UTF-16 BOM if a file is open in append mode #49256

vstinner commented Jan 19, 2009

vstinner commented Jan 19, 2009

vstinner commented Jan 19, 2009

vstinner commented Jan 21, 2009

pitrou commented Mar 14, 2009

vstinner commented Mar 20, 2009

pitrou commented Mar 26, 2009

vstinner commented Mar 26, 2009

pitrou commented Mar 26, 2009

vstinner commented Mar 26, 2009

vstinner commented Mar 26, 2009

pitrou commented Mar 28, 2009

pitrou commented Mar 28, 2009

vstinner commented Mar 29, 2009

pitrou commented May 10, 2009

pitrou commented May 14, 2009

benjaminp commented May 14, 2009

pitrou commented May 14, 2009

ConradIrwin mannequin commented Jun 30, 2010

vstinner commented Jul 28, 2010

Duplicate UTF-16 BOM if a file is open in append mode #49256

Duplicate UTF-16 BOM if a file is open in append mode #49256

Comments

vstinner commented Jan 19, 2009

vstinner commented Jan 19, 2009

vstinner commented Jan 19, 2009

vstinner commented Jan 21, 2009

pitrou commented Mar 14, 2009

vstinner commented Mar 20, 2009

pitrou commented Mar 26, 2009

vstinner commented Mar 26, 2009

pitrou commented Mar 26, 2009

vstinner commented Mar 26, 2009

vstinner commented Mar 26, 2009

pitrou commented Mar 28, 2009

pitrou commented Mar 28, 2009

vstinner commented Mar 29, 2009

pitrou commented May 10, 2009

pitrou commented May 14, 2009

benjaminp commented May 14, 2009

pitrou commented May 14, 2009

ConradIrwin mannequin commented Jun 30, 2010

vstinner commented Jul 28, 2010