classification
Title: UnicodeDecodeError when readline in codecs.py
Type: behavior Stage: resolved
Components: Unicode Versions: Python 3.2, Python 3.3, Python 3.4, Python 2.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: UTF-16 incremental decoder doesn't support partial surrogate pair
View: 11461
Assigned To: Nosy List: Marcus.Gröber, doerwalter, ezio.melotti, lovelylain, serhiy.storchaka, vstinner
Priority: normal Keywords: needs review, patch

Created on 2012-07-07 15:18 by lovelylain, last changed 2013-01-07 17:54 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
utf16_partial_decode-3.3.patch serhiy.storchaka, 2012-10-08 16:10 Patch for 3.3 and 3.4 review
utf16_partial_decode-3.2.patch serhiy.storchaka, 2012-10-08 16:10 Patch for 3.2 review
utf16_partial_decode-2.7.patch serhiy.storchaka, 2012-10-08 16:10 Patch for 2.7 review
Messages (12)
msg164869 - (view) Author: lovelylain (lovelylain) Date: 2012-07-07 15:18
This is an example, `for line in fp` will raise UnicodeDecodeError:
#! -*- coding: utf-8 -*-
import codecs

text = u'\u6731' + u'\U0002a6a5' * 18
print repr(text)

with codecs.open('test.txt', 'wb', 'utf-16-le') as fp:
    fp.write(text)

with codecs.open('test.txt', 'rb', 'utf-16-le') as fp:
    print repr(fp.read())

with codecs.open('test.txt', 'rb', 'utf-16-le') as fp:
    for line in fp:
        print repr(line)

I read code in codecs.py:
    def read(self, size=-1, chars=-1, firstline=False):

        """ Decodes data from the stream self.stream and returns the
            resulting object.
...
            If firstline is true, and a UnicodeDecodeError happens
            after the first line terminator in the input only the first line
            will be returned, the rest of the input will be kept until the
            next call to read().

        """
...
            try:
                newchars, decodedbytes = self.decode(data, self.errors)
            except UnicodeDecodeError, exc:
                if firstline:
                    newchars, decodedbytes = self.decode(data[:exc.start], self.errors)
                    lines = newchars.splitlines(True)
                    if len(lines)<=1:
                        raise
                else:
                    raise
...

It seems that the firstline argument is not consistent with its doc description.
I don't konw why this argument was added and why lines count was checked.
If it was added for readline function to fix some decode errors, we may have no EOLs in data readed, so it caused UnicodeDecodeError too.
Maybe we should write code like below to support codecs readline.

    def read(self, size=-1, chars=-1, autotruncate=False):
...
            try:
                newchars, decodedbytes = self.decode(data, self.errors)
            except UnicodeDecodeError, exc:
                if autotruncate and exc.start:
                    newchars, decodedbytes = self.decode(data[:exc.start], self.errors)
                else:
                    raise
...
msg172372 - (view) Author: Marcus Gröber (Marcus.Gröber) Date: 2012-10-08 11:19
I came across this today as well. A short way of summarizing this error seems to be:

Reading a file using readline (or "for line in file") fails, if the following two conditions are true:

•	A codec (e.g. UTF-8) for a multi-byte encoding is used, and
•	The first line of the file is at least 73 bytes long, and contains a multi-byte-sequence that starts before offset 72, and ends after offset 72

At least for UTF-8 input files, it may be possible to work around this by opening the input file without a codec, and then applying decode("utf-8") to each line.
msg172391 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-08 16:07
This error happens due to the fact that utf16* decoders do not properly partial decode truncated data. Exception raised if input data truncated on the second surrogate in the surrogate pair. For example codecs.utf_16_le_decode(b'\x00\xd8\x00') should return ('', 0), but raises UnicodeDecodeError.
msg172392 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-08 16:10
Here are the patches.
msg172527 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-10-09 21:11
This issue may be related or a duplicate of #11461.

> For example codecs.utf_16_le_decode(b'\x00\xd8\x00') should return ('', 0), but raises UnicodeDecodeError.

Only incremental decoder should return partial results. Other decoders are strict and (usually) stateless.

$ ./python 
>>> import codecs
>>> decoder = codecs.getdecoder('utf8')
>>> decoder('\u20ac'.encode('utf8'), 'strict')
('€', 3)
>>> decoder('\u20ac'.encode('utf8')[:2], 'strict')
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data
msg172529 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-10-09 21:17
> with codecs.open('test.txt', 'wb', 'utf-16-le') as fp:

Since Python 2.6+, you can use io.open() which uses the new io library. The io library uses TextIOWrapper which uses incremental encoder and decoder and so handles multibyte encodings correctly (as UTF-16).

Said differently, this issue is already fixed in the io library.

It remembers me that I should propose again my PEP 400 :-)
msg172530 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-10-09 21:19
> This issue may be related or a duplicate of #11461.

Hum no. The bug is an issue in the design of codecs.Stream* classes: incremental decoders and encoders should be used instead of classic decoders/encoders.

I don't want to fix this issue: it's better to move to the io library for the reasons listed in the PEP 400.
msg172532 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-09 21:34
> This issue may be related or a duplicate of #11461.

Oh, yes, it is a duplicate. I totally forgot about it and made the work again.

> Only incremental decoder should return partial results. Other decoders are
> strict and (usually) stateless.

Yes, there is a incremental decoder.

> >>> decoder('\u20ac'.encode('utf8')[:2], 'strict')
> 
> UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1:
> unexpected end of data

>>> codecs.utf_8_decode('\u20ac'.encode('utf8')[:2])
('', 0)
msg172535 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-10-09 21:39
>>> codecs.utf_8_decode('\u20ac'.encode('utf8')[:2])
('', 0)

Oh... codecs.CODEC_decode are incremental decoders? I misunderstood completly this.

"The bug is an issue in the design of codecs.Stream* classes: incremental decoders and encoders should be used instead of classic decoders/encoders."

Hum, I suppose that the issue cannot be reproduded with TextIOWrapper, just because io.TextIOWrapper and codecs.StreamReader use different buffer sizes.
msg172536 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-09 21:43
> Hum no. The bug is an issue in the design of codecs.Stream* classes: incremental decoders and encoders should be used instead of classic decoders/encoders.

I don't understand you. StreamReader and IncrementalDecoder both use the same decoder.

class IncrementalDecoder(codecs.BufferedIncrementalDecoder):
    _buffer_decode = codecs.utf_16_le_decode

class StreamReader(codecs.StreamReader):
    decode = codecs.utf_16_le_decode

> I don't want to fix this issue: it's better to move to the io library for the reasons listed in the PEP 400.

The bug in utf-16 decoder, not in codecs.StreamReader.
msg172537 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-10-09 21:46
> I don't understand you.

Read my last message, I was wrong.
msg172558 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2012-10-10 09:49
> >>> codecs.utf_8_decode('\u20ac'.encode('utf8')[:2])
> ('', 0)
>
> Oh... codecs.CODEC_decode are incremental decoders? I misunderstood completly this.

No, those function are not decoders, they're just helper functions used to implement the real incremental decoders. That's why they're undocumented.

Whether codecs.utf_8_decode() returns partial results or raises an exception depends on the final argument::

>>> s = '\u20ac'.encode('utf8')[:2]
>>> codecs.utf_8_decode(s, 'strict')
('', 0)
>>> codecs.utf_8_decode(s, 'strict', False)
('', 0)
>>> codecs.utf_8_decode(s, 'strict', True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data

If you look at encodings/utf_8.py you see that the stateless decoder call codecs.utf_8_decode() with final==True::

    def decode(input, errors='strict'):
        return codecs.utf_8_decode(input, errors, True)

so the stateless decoder *will* raise exceptions for partial results. The incremental decoder simply passed on the final argument given to its encode() method.
History
Date User Action Args
2013-01-07 17:54:49serhiy.storchakasetstatus: open -> closed
superseder: UTF-16 incremental decoder doesn't support partial surrogate pair
resolution: duplicate
stage: patch review -> resolved
2012-10-24 09:32:48serhiy.storchakasetkeywords: + needs review
stage: patch review
2012-10-10 09:49:29doerwaltersetnosy: + doerwalter
messages: + msg172558
2012-10-09 21:46:17vstinnersetmessages: + msg172537
2012-10-09 21:43:56serhiy.storchakasetmessages: + msg172536
2012-10-09 21:39:20vstinnersetmessages: + msg172535
2012-10-09 21:34:22serhiy.storchakasetmessages: + msg172532
2012-10-09 21:19:47vstinnersetmessages: + msg172530
2012-10-09 21:17:31vstinnersetmessages: + msg172529
2012-10-09 21:11:47vstinnersetmessages: + msg172527
2012-10-08 21:15:22pitrousetnosy: + vstinner
2012-10-08 16:10:09serhiy.storchakasetfiles: + utf16_partial_decode-3.3.patch, utf16_partial_decode-3.2.patch, utf16_partial_decode-2.7.patch
keywords: + patch
messages: + msg172392
2012-10-08 16:07:16serhiy.storchakasetversions: + Python 3.3, Python 3.4, - Python 2.6, Python 3.1
nosy: + ezio.melotti, serhiy.storchaka

messages: + msg172391

components: + Unicode, - Library (Lib)
2012-10-08 11:19:25Marcus.Gröbersetnosy: + Marcus.Gröber
messages: + msg172372
2012-07-07 15:18:38lovelylaincreate