This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: UTF-16 stream codec barfs on valid input
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.0
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: gvanrossum
Priority: critical Keywords:

Created on 2009-01-16 21:26 by gvanrossum, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
contacts.csv gvanrossum, 2009-01-16 21:26 UTF-16 with BOM
Messages (2)
msg79976 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2009-01-16 21:26
I am attaching a file encoded in UTF-16 (with bom) which causes the
stream codec employed by the file reader to barf when reading by lines.
 However reading the file in binary mode and decoding it in one fell
swoop works fine, and reading the whole text file with text() also works
fine; so I believe the data in the file is not corrupt (it started out
as an export of my Gmail contacts, but I x-ed out all printable ASCII
characters).

>>> x = open('contacts.csv', 'rb').read().decode('utf16')  # OK
>>> x = open('contacts.csv', encoding='utf16').read()  # OK
>>> x = open('contacts.csv', encoding='utf16').readlines()  # Dies
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.0/io.py", line 534, in readlines
    return list(self)
  File "/usr/local/lib/python3.0/io.py", line 1739, in __next__
    line = self.readline()
  File "/usr/local/lib/python3.0/io.py", line 1813, in readline
    while self._read_chunk():
  File "/usr/local/lib/python3.0/io.py", line 1562, in _read_chunk
    self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
  File "/usr/local/lib/python3.0/io.py", line 1295, in decode
    output = self.decoder.decode(input, final=final)
  File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
  File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
_buffer_decode
    return self.decoder(input, self.errors, final)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x00 in position 0:
truncated data
>>>

Making certain modifications to the file elicits slightly different
error messages (e.g. "'utf16' codec can't decode bytes in position
90-91: illegal encoding" when I swap the second and first half of the
file) so it looks like some kind of data corruption in the codec's state
management or in the code in io.py that feeds the codec its data.
msg79977 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2009-01-16 21:35
Dang. Already fixed in trunk. (Is it fixed in 3.0.1 too?)
History
Date User Action Args
2022-04-11 14:56:44adminsetgithub: 49214
2009-01-16 21:35:13gvanrossumsetstatus: open -> closed
resolution: out of date
messages: + msg79977
2009-01-16 21:26:08gvanrossumcreate