This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author lemburg
Recipients amaury.forgeotdarc, lemburg, pitrou, vstinner
Date 2009-01-08.13:50:49
SpamBayes Score 0.000249965
Marked as misclassified No
Message-id <496604B8.6070900@egenix.com>
In-reply-to <1231287677.41.0.817743984337.issue4862@psf.upfronthosting.co.za>
Content
On 2009-01-07 01:21, Amaury Forgeot d'Arc wrote:
> First write a utf-16 file with its signature:
> 
>>>> f1 = open('utf16.txt', 'w', encoding='utf-16')
>>>> f1.write('0123456789')
>>>> f1.close()
> 
> Then read it twice:
> 
>>>> f2 = open('utf16.txt', 'r', encoding='utf-16')
>>>> print('read1', ascii(f2.read()))
> read1 '0123456789'
>>>> f2.seek(0)
> 0
>>>> print('read2', ascii(f2.read()))
> read2 '\ufeff0123456789'
> 
> The second read returns the BOM!
> This is because the zero in seek(0) is a "cookie" which contains both the position 
> and the decoder state. Unfortunately, state=0 means 'endianness has been determined: 
> native order'.
> 
> maybe a suggestion: handle seek(0) as a special value which calls decoder.reset().
> The patch implement this idea.

This is a problem with the utf_16.py codec, not the io layer.
Opening a file in append mode is something that the io layer
would have to handle, since the codec doesn't know anything about
the underlying file mode.

Using .reset() will not help. The code for the StreamReader
and StreamWriter in utf_16.py will have to be modified to undo
the adjustment of the .encode() and .decode() method after using
.seek(0).

Note that there's also the case .seek(1) - I guess this must
be considered as resulting in undefined behavior.
History
Date User Action Args
2009-01-08 13:50:51lemburgsetrecipients: + lemburg, amaury.forgeotdarc, pitrou, vstinner
2009-01-08 13:50:50lemburglinkissue4862 messages
2009-01-08 13:50:49lemburgcreate