utf-16 BOM is not skipped after seek(0) #49112

amauryfa · 2009-01-07T00:21:17Z

BPO	4862
Nosy	@malemburg, @amauryfa, @pitrou, @vstinner, @benjaminp
Files	io_utf16.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2009-03-04.21:36:57.479>
created_at = <Date 2009-01-07.00:21:16.662>
labels = []
title = 'utf-16 BOM is not skipped after seek(0)'
updated_at = <Date 2009-03-05.01:00:49.629>
user = 'https://github.com/amauryfa'

bugs.python.org fields:

activity = <Date 2009-03-05.01:00:49.629>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2009-03-04.21:36:57.479>
closer = 'benjamin.peterson'
components = []
creation = <Date 2009-01-07.00:21:16.662>
creator = 'amaury.forgeotdarc'
dependencies = []
files = ['12627']
hgrepos = []
issue_num = 4862
keywords = ['patch']
message_count = 11.0
messages = ['79299', '79325', '79326', '79330', '79410', '80211', '80222', '83141', '83167', '83173', '83176']
nosy_count = 5.0
nosy_names = ['lemburg', 'amaury.forgeotdarc', 'pitrou', 'vstinner', 'benjamin.peterson']
pr_nums = []
priority = 'critical'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue4862'
versions = ['Python 3.0']

amauryfa · 2009-01-07T00:21:16Z

First write a utf-16 file with its signature:

>> f1 = open('utf16.txt', 'w', encoding='utf-16')
>> f1.write('0123456789')
>> f1.close()

Then read it twice:

>>> f2 = open('utf16.txt', 'r', encoding='utf-16')
>>> print('read1', ascii(f2.read()))
read1 '0123456789'
>>> f2.seek(0)
0
>>> print('read2', ascii(f2.read()))
read2 '\ufeff0123456789'

The second read returns the BOM!
This is because the zero in seek(0) is a "cookie" which contains both the position
and the decoder state. Unfortunately, state=0 means 'endianness has been determined:
native order'.

maybe a suggestion: handle seek(0) as a special value which calls decoder.reset().
The patch implement this idea.

vstinner · 2009-01-07T11:46:49Z

This is because the zero in seek(0) is a "cookie"
which contains both the position and the decoder state.
Unfortunately, state=0 means 'endianness has been determined:
native order'.

The problem is maybe that TextIOWrapper._pack_cookie() can create a
cookie=0. Example to create a non-null value, replace:
def _pack_cookie(self, position, ...):
return (position | (dec_flags<<64) | ...
def _unpack_cookie(self, bigint):
rest, position = divmod(bigint, 1<<64)
...
by
def _pack_cookie(self, position, ...):
return (1 | (position<<1) | (dec_flags<<65) | ...
def _unpack_cookie(self, bigint):
if not (bigint & 1):
raise ValueError("invalid cookie")
bigint >>= 1
rest, position = divmod(bigint, 1<<64)
...

Why the cookie is an integer and not an object with attributes?

amauryfa · 2009-01-07T12:03:28Z

The problem is maybe that TextIOWrapper._pack_cookie()
can create a cookie=0

But only when position==0.
And in this case, at the beginning of the stream, it makes sense to
reset everything to its initial value: zero for the various counts, and
call decoder.reset()

pitrou · 2009-01-07T12:59:00Z

Well, there are other problems with utf-16, e.g. when opening an
existing file for appending, the BOM is written again:

>>> f = open('utf16.txt', 'w', encoding='utf-16')
>>> f.write('abc')
3
>>> f.close()
>>> f = open('utf16.txt', 'a', encoding='utf-16')
>>> f.write('def')
3
>>> f.close()
>>> open('utf16.txt', 'r', encoding='utf-16').read()
'abc\ufeffdef'

Who said TextIOWrapper was sane? :-o

malemburg · 2009-01-08T13:50:50Z

On 2009-01-07 01:21, Amaury Forgeot d'Arc wrote:
> First write a utf-16 file with its signature:
> 
>>>> f1 = open('utf16.txt', 'w', encoding='utf-16')
>>>> f1.write('0123456789')
>>>> f1.close()
> 
> Then read it twice:
> 
>>>> f2 = open('utf16.txt', 'r', encoding='utf-16')
>>>> print('read1', ascii(f2.read()))
> read1 '0123456789'
>>>> f2.seek(0)
> 0
>>>> print('read2', ascii(f2.read()))
> read2 '\ufeff0123456789'
> 
> The second read returns the BOM!
> This is because the zero in seek(0) is a "cookie" which contains both the position 
> and the decoder state. Unfortunately, state=0 means 'endianness has been determined: 
> native order'.
> 
> maybe a suggestion: handle seek(0) as a special value which calls decoder.reset().
> The patch implement this idea.

This is a problem with the utf_16.py codec, not the io layer.
Opening a file in append mode is something that the io layer
would have to handle, since the codec doesn't know anything about
the underlying file mode.

Using .reset() will not help. The code for the StreamReader
and StreamWriter in utf_16.py will have to be modified to undo
the adjustment of the .encode() and .decode() method after using
.seek(0).

Note that there's also the case .seek(1) - I guess this must
be considered as resulting in undefined behavior.

pitrou · 2009-01-19T21:27:41Z

I support Amaury's suggestion (actually I implemented it in the io-c
branch). Resetting the decoder when seeking to the beginning of the
stream is a reasonable way to deal with those incremental decoders for
which the start state is something else than (b"", 0).

(and, you're right, opening in append mode is a different problem...)

vstinner · 2009-01-19T23:25:00Z

I opened a different issue (bpo-5006) for the duplicate BOM in append
mode.

benjaminp · 2009-03-04T21:36:57Z

This has been fixed by the io-c branch merge.

vstinner · 2009-03-05T00:24:46Z

This has been fixed by the io-c branch merge.

Can you at least include the patch to test_io.py from amaury's patch?

And why not fixing the Python version of the io module (i'm not sure
of the new name: _pyio?) since we have a working patch?

benjaminp · 2009-03-05T00:42:45Z

Ah, I forgot this wasn't applied to the Python implementation. Fixed in
r70184.

vstinner · 2009-03-05T01:00:50Z

@benjamin: ok, great.

benjaminp closed this as completed Mar 4, 2009

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utf-16 BOM is not skipped after seek(0) #49112

utf-16 BOM is not skipped after seek(0) #49112

amauryfa commented Jan 7, 2009

amauryfa commented Jan 7, 2009

vstinner commented Jan 7, 2009

amauryfa commented Jan 7, 2009

pitrou commented Jan 7, 2009

malemburg commented Jan 8, 2009

pitrou commented Jan 19, 2009

vstinner commented Jan 19, 2009

benjaminp commented Mar 4, 2009

vstinner commented Mar 5, 2009

benjaminp commented Mar 5, 2009

vstinner commented Mar 5, 2009

utf-16 BOM is not skipped after seek(0) #49112

utf-16 BOM is not skipped after seek(0) #49112

Comments

amauryfa commented Jan 7, 2009

amauryfa commented Jan 7, 2009

vstinner commented Jan 7, 2009

amauryfa commented Jan 7, 2009

pitrou commented Jan 7, 2009

malemburg commented Jan 8, 2009

pitrou commented Jan 19, 2009

vstinner commented Jan 19, 2009

benjaminp commented Mar 4, 2009

vstinner commented Mar 5, 2009

benjaminp commented Mar 5, 2009

vstinner commented Mar 5, 2009