Message 79410 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	amaury.forgeotdarc, lemburg, pitrou, vstinner
Date	2009-01-08.13:50:49
SpamBayes Score	0.00024996523
Marked as misclassified	No
Message-id	<496604B8.6070900@egenix.com>
In-reply-to	<1231287677.41.0.817743984337.issue4862@psf.upfronthosting.co.za>

Content
On 2009-01-07 01:21, Amaury Forgeot d'Arc wrote: > First write a utf-16 file with its signature: > >>>> f1 = open('utf16.txt', 'w', encoding='utf-16') >>>> f1.write('0123456789') >>>> f1.close() > > Then read it twice: > >>>> f2 = open('utf16.txt', 'r', encoding='utf-16') >>>> print('read1', ascii(f2.read())) > read1 '0123456789' >>>> f2.seek(0) > 0 >>>> print('read2', ascii(f2.read())) > read2 '\ufeff0123456789' > > The second read returns the BOM! > This is because the zero in seek(0) is a "cookie" which contains both the position > and the decoder state. Unfortunately, state=0 means 'endianness has been determined: > native order'. > > maybe a suggestion: handle seek(0) as a special value which calls decoder.reset(). > The patch implement this idea. This is a problem with the utf_16.py codec, not the io layer. Opening a file in append mode is something that the io layer would have to handle, since the codec doesn't know anything about the underlying file mode. Using .reset() will not help. The code for the StreamReader and StreamWriter in utf_16.py will have to be modified to undo the adjustment of the .encode() and .decode() method after using .seek(0). Note that there's also the case .seek(1) - I guess this must be considered as resulting in undefined behavior.

On 2009-01-07 01:21, Amaury Forgeot d'Arc wrote:
> First write a utf-16 file with its signature:
> 
>>>> f1 = open('utf16.txt', 'w', encoding='utf-16')
>>>> f1.write('0123456789')
>>>> f1.close()
> 
> Then read it twice:
> 
>>>> f2 = open('utf16.txt', 'r', encoding='utf-16')
>>>> print('read1', ascii(f2.read()))
> read1 '0123456789'
>>>> f2.seek(0)
> 0
>>>> print('read2', ascii(f2.read()))
> read2 '\ufeff0123456789'
> 
> The second read returns the BOM!
> This is because the zero in seek(0) is a "cookie" which contains both the position 
> and the decoder state. Unfortunately, state=0 means 'endianness has been determined: 
> native order'.
> 
> maybe a suggestion: handle seek(0) as a special value which calls decoder.reset().
> The patch implement this idea.

This is a problem with the utf_16.py codec, not the io layer.
Opening a file in append mode is something that the io layer
would have to handle, since the codec doesn't know anything about
the underlying file mode.

Using .reset() will not help. The code for the StreamReader
and StreamWriter in utf_16.py will have to be modified to undo
the adjustment of the .encode() and .decode() method after using
.seek(0).

Note that there's also the case .seek(1) - I guess this must
be considered as resulting in undefined behavior.

History
Date	User	Action	Args
2009-01-08 13:50:51	lemburg	set	recipients: + lemburg, amaury.forgeotdarc, pitrou, vstinner
2009-01-08 13:50:50	lemburg	link	issue4862 messages
2009-01-08 13:50:49	lemburg	create