Issue 4964: UTF-16 stream codec barfs on valid input

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/49214

classification

Title:	UTF-16 stream codec barfs on valid input
Type:	behavior	Stage:
Components:	Library (Lib)	Versions:	Python 3.0

process

Status:	closed	Resolution:	out of date
Dependencies:		Superseder:
Assigned To:		Nosy List:	gvanrossum
Priority:	critical	Keywords:

Created on 2009-01-16 21:26 by gvanrossum, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
contacts.csv	gvanrossum, 2009-01-16 21:26	UTF-16 with BOM

Messages (2)
msg79976 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2009-01-16 21:26
I am attaching a file encoded in UTF-16 (with bom) which causes the stream codec employed by the file reader to barf when reading by lines. However reading the file in binary mode and decoding it in one fell swoop works fine, and reading the whole text file with text() also works fine; so I believe the data in the file is not corrupt (it started out as an export of my Gmail contacts, but I x-ed out all printable ASCII characters). >>> x = open('contacts.csv', 'rb').read().decode('utf16') # OK >>> x = open('contacts.csv', encoding='utf16').read() # OK >>> x = open('contacts.csv', encoding='utf16').readlines() # Dies Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.0/io.py", line 534, in readlines return list(self) File "/usr/local/lib/python3.0/io.py", line 1739, in __next__ line = self.readline() File "/usr/local/lib/python3.0/io.py", line 1813, in readline while self._read_chunk(): File "/usr/local/lib/python3.0/io.py", line 1562, in _read_chunk self._set_decoded_chars(self._decoder.decode(input_chunk, eof)) File "/usr/local/lib/python3.0/io.py", line 1295, in decode output = self.decoder.decode(input, final=final) File "/usr/local/lib/python3.0/codecs.py", line 300, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in _buffer_decode return self.decoder(input, self.errors, final) UnicodeDecodeError: 'utf16' codec can't decode byte 0x00 in position 0: truncated data >>> Making certain modifications to the file elicits slightly different error messages (e.g. "'utf16' codec can't decode bytes in position 90-91: illegal encoding" when I swap the second and first half of the file) so it looks like some kind of data corruption in the codec's state management or in the code in io.py that feeds the codec its data.
msg79977 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2009-01-16 21:35
Dang. Already fixed in trunk. (Is it fixed in 3.0.1 too?)

History
Date	User	Action	Args
2022-04-11 14:56:44	admin	set	github: 49214
2009-01-16 21:35:13	gvanrossum	set	status: open -> closed resolution: out of date messages: + msg79977
2009-01-16 21:26:08	gvanrossum	create