Message 164869 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lovelylain
Recipients	lovelylain
Date	2012-07-07.15:18:38
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1341674319.29.0.269195397783.issue15278@psf.upfronthosting.co.za>
In-reply-to

Content
This is an example, `for line in fp` will raise UnicodeDecodeError: #! -- coding: utf-8 -- import codecs text = u'\u6731' + u'\U0002a6a5' * 18 print repr(text) with codecs.open('test.txt', 'wb', 'utf-16-le') as fp: fp.write(text) with codecs.open('test.txt', 'rb', 'utf-16-le') as fp: print repr(fp.read()) with codecs.open('test.txt', 'rb', 'utf-16-le') as fp: for line in fp: print repr(line) I read code in codecs.py: def read(self, size=-1, chars=-1, firstline=False): """ Decodes data from the stream self.stream and returns the resulting object. ... If firstline is true, and a UnicodeDecodeError happens after the first line terminator in the input only the first line will be returned, the rest of the input will be kept until the next call to read(). """ ... try: newchars, decodedbytes = self.decode(data, self.errors) except UnicodeDecodeError, exc: if firstline: newchars, decodedbytes = self.decode(data[:exc.start], self.errors) lines = newchars.splitlines(True) if len(lines)<=1: raise else: raise ... It seems that the firstline argument is not consistent with its doc description. I don't konw why this argument was added and why lines count was checked. If it was added for readline function to fix some decode errors, we may have no EOLs in data readed, so it caused UnicodeDecodeError too. Maybe we should write code like below to support codecs readline. def read(self, size=-1, chars=-1, autotruncate=False): ... try: newchars, decodedbytes = self.decode(data, self.errors) except UnicodeDecodeError, exc: if autotruncate and exc.start: newchars, decodedbytes = self.decode(data[:exc.start], self.errors) else: raise ...

This is an example, `for line in fp` will raise UnicodeDecodeError:
#! -*- coding: utf-8 -*-
import codecs

text = u'\u6731' + u'\U0002a6a5' * 18
print repr(text)

with codecs.open('test.txt', 'wb', 'utf-16-le') as fp:
    fp.write(text)

with codecs.open('test.txt', 'rb', 'utf-16-le') as fp:
    print repr(fp.read())

with codecs.open('test.txt', 'rb', 'utf-16-le') as fp:
    for line in fp:
        print repr(line)

I read code in codecs.py:
    def read(self, size=-1, chars=-1, firstline=False):

        """ Decodes data from the stream self.stream and returns the
            resulting object.
...
            If firstline is true, and a UnicodeDecodeError happens
            after the first line terminator in the input only the first line
            will be returned, the rest of the input will be kept until the
            next call to read().

        """
...
            try:
                newchars, decodedbytes = self.decode(data, self.errors)
            except UnicodeDecodeError, exc:
                if firstline:
                    newchars, decodedbytes = self.decode(data[:exc.start], self.errors)
                    lines = newchars.splitlines(True)
                    if len(lines)<=1:
                        raise
                else:
                    raise
...

It seems that the firstline argument is not consistent with its doc description.
I don't konw why this argument was added and why lines count was checked.
If it was added for readline function to fix some decode errors, we may have no EOLs in data readed, so it caused UnicodeDecodeError too.
Maybe we should write code like below to support codecs readline.

    def read(self, size=-1, chars=-1, autotruncate=False):
...
            try:
                newchars, decodedbytes = self.decode(data, self.errors)
            except UnicodeDecodeError, exc:
                if autotruncate and exc.start:
                    newchars, decodedbytes = self.decode(data[:exc.start], self.errors)
                else:
                    raise
...

History
Date	User	Action	Args
2012-07-07 15:18:39	lovelylain	set	recipients: + lovelylain
2012-07-07 15:18:39	lovelylain	set	messageid: <1341674319.29.0.269195397783.issue15278@psf.upfronthosting.co.za>
2012-07-07 15:18:38	lovelylain	link	issue15278 messages
2012-07-07 15:18:38	lovelylain	create