New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
utf-16 BOM is not skipped after seek(0) #49112
Comments
First write a utf-16 file with its signature:
Then read it twice: >>> f2 = open('utf16.txt', 'r', encoding='utf-16')
>>> print('read1', ascii(f2.read()))
read1 '0123456789'
>>> f2.seek(0)
0
>>> print('read2', ascii(f2.read()))
read2 '\ufeff0123456789' The second read returns the BOM! maybe a suggestion: handle seek(0) as a special value which calls decoder.reset(). |
The problem is maybe that TextIOWrapper._pack_cookie() can create a Why the cookie is an integer and not an object with attributes? |
But only when position==0. |
Well, there are other problems with utf-16, e.g. when opening an >>> f = open('utf16.txt', 'w', encoding='utf-16')
>>> f.write('abc')
3
>>> f.close()
>>> f = open('utf16.txt', 'a', encoding='utf-16')
>>> f.write('def')
3
>>> f.close()
>>> open('utf16.txt', 'r', encoding='utf-16').read()
'abc\ufeffdef' Who said TextIOWrapper was sane? :-o |
On 2009-01-07 01:21, Amaury Forgeot d'Arc wrote:
> First write a utf-16 file with its signature:
>
>>>> f1 = open('utf16.txt', 'w', encoding='utf-16')
>>>> f1.write('0123456789')
>>>> f1.close()
>
> Then read it twice:
>
>>>> f2 = open('utf16.txt', 'r', encoding='utf-16')
>>>> print('read1', ascii(f2.read()))
> read1 '0123456789'
>>>> f2.seek(0)
> 0
>>>> print('read2', ascii(f2.read()))
> read2 '\ufeff0123456789'
>
> The second read returns the BOM!
> This is because the zero in seek(0) is a "cookie" which contains both the position
> and the decoder state. Unfortunately, state=0 means 'endianness has been determined:
> native order'.
>
> maybe a suggestion: handle seek(0) as a special value which calls decoder.reset().
> The patch implement this idea. This is a problem with the utf_16.py codec, not the io layer. Using .reset() will not help. The code for the StreamReader Note that there's also the case .seek(1) - I guess this must |
I support Amaury's suggestion (actually I implemented it in the io-c (and, you're right, opening in append mode is a different problem...) |
I opened a different issue (bpo-5006) for the duplicate BOM in append |
This has been fixed by the io-c branch merge. |
Can you at least include the patch to test_io.py from amaury's patch? And why not fixing the Python version of the io module (i'm not sure |
Ah, I forgot this wasn't applied to the Python implementation. Fixed in |
@benjamin: ok, great. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: