Issue 15278: UnicodeDecodeError when readline in codecs.py

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/59483

classification

Title:	UnicodeDecodeError when readline in codecs.py
Type:	behavior	Stage:	resolved
Components:	Unicode	Versions:	Python 3.2, Python 3.3, Python 3.4, Python 2.7

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:	UTF-16 incremental decoder doesn't support partial surrogate pair View: 11461
Assigned To:		Nosy List:	Marcus.Gröber, doerwalter, ezio.melotti, lovelylain, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	needs review, patch

Created on 2012-07-07 15:18 by lovelylain, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
utf16_partial_decode-3.3.patch	serhiy.storchaka, 2012-10-08 16:10	Patch for 3.3 and 3.4	review
utf16_partial_decode-3.2.patch	serhiy.storchaka, 2012-10-08 16:10	Patch for 3.2	review
utf16_partial_decode-2.7.patch	serhiy.storchaka, 2012-10-08 16:10	Patch for 2.7	review

Messages (12)
msg164869 - (view)	Author: lovelylain (lovelylain)	Date: 2012-07-07 15:18
This is an example, `for line in fp` will raise UnicodeDecodeError: #! -- coding: utf-8 -- import codecs text = u'\u6731' + u'\U0002a6a5' * 18 print repr(text) with codecs.open('test.txt', 'wb', 'utf-16-le') as fp: fp.write(text) with codecs.open('test.txt', 'rb', 'utf-16-le') as fp: print repr(fp.read()) with codecs.open('test.txt', 'rb', 'utf-16-le') as fp: for line in fp: print repr(line) I read code in codecs.py: def read(self, size=-1, chars=-1, firstline=False): """ Decodes data from the stream self.stream and returns the resulting object. ... If firstline is true, and a UnicodeDecodeError happens after the first line terminator in the input only the first line will be returned, the rest of the input will be kept until the next call to read(). """ ... try: newchars, decodedbytes = self.decode(data, self.errors) except UnicodeDecodeError, exc: if firstline: newchars, decodedbytes = self.decode(data[:exc.start], self.errors) lines = newchars.splitlines(True) if len(lines)<=1: raise else: raise ... It seems that the firstline argument is not consistent with its doc description. I don't konw why this argument was added and why lines count was checked. If it was added for readline function to fix some decode errors, we may have no EOLs in data readed, so it caused UnicodeDecodeError too. Maybe we should write code like below to support codecs readline. def read(self, size=-1, chars=-1, autotruncate=False): ... try: newchars, decodedbytes = self.decode(data, self.errors) except UnicodeDecodeError, exc: if autotruncate and exc.start: newchars, decodedbytes = self.decode(data[:exc.start], self.errors) else: raise ...
msg172372 - (view)	Author: Marcus Gröber (Marcus.Gröber)	Date: 2012-10-08 11:19
I came across this today as well. A short way of summarizing this error seems to be: Reading a file using readline (or "for line in file") fails, if the following two conditions are true: • A codec (e.g. UTF-8) for a multi-byte encoding is used, and • The first line of the file is at least 73 bytes long, and contains a multi-byte-sequence that starts before offset 72, and ends after offset 72 At least for UTF-8 input files, it may be possible to work around this by opening the input file without a codec, and then applying decode("utf-8") to each line.
msg172391 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-10-08 16:07
This error happens due to the fact that utf16* decoders do not properly partial decode truncated data. Exception raised if input data truncated on the second surrogate in the surrogate pair. For example codecs.utf_16_le_decode(b'\x00\xd8\x00') should return ('', 0), but raises UnicodeDecodeError.
msg172392 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-10-08 16:10
Here are the patches.
msg172527 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-10-09 21:11
This issue may be related or a duplicate of #11461. > For example codecs.utf_16_le_decode(b'\x00\xd8\x00') should return ('', 0), but raises UnicodeDecodeError. Only incremental decoder should return partial results. Other decoders are strict and (usually) stateless. $ ./python >>> import codecs >>> decoder = codecs.getdecoder('utf8') >>> decoder('\u20ac'.encode('utf8'), 'strict') ('€', 3) >>> decoder('\u20ac'.encode('utf8')[:2], 'strict') UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data
msg172529 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-10-09 21:17
> with codecs.open('test.txt', 'wb', 'utf-16-le') as fp: Since Python 2.6+, you can use io.open() which uses the new io library. The io library uses TextIOWrapper which uses incremental encoder and decoder and so handles multibyte encodings correctly (as UTF-16). Said differently, this issue is already fixed in the io library. It remembers me that I should propose again my PEP 400 :-)
msg172530 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-10-09 21:19
> This issue may be related or a duplicate of #11461. Hum no. The bug is an issue in the design of codecs.Stream* classes: incremental decoders and encoders should be used instead of classic decoders/encoders. I don't want to fix this issue: it's better to move to the io library for the reasons listed in the PEP 400.
msg172532 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-10-09 21:34
> This issue may be related or a duplicate of #11461. Oh, yes, it is a duplicate. I totally forgot about it and made the work again. > Only incremental decoder should return partial results. Other decoders are > strict and (usually) stateless. Yes, there is a incremental decoder. > >>> decoder('\u20ac'.encode('utf8')[:2], 'strict') > > UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: > unexpected end of data >>> codecs.utf_8_decode('\u20ac'.encode('utf8')[:2]) ('', 0)
msg172535 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-10-09 21:39
>>> codecs.utf_8_decode('\u20ac'.encode('utf8')[:2]) ('', 0) Oh... codecs.CODEC_decode are incremental decoders? I misunderstood completly this. "The bug is an issue in the design of codecs.Stream* classes: incremental decoders and encoders should be used instead of classic decoders/encoders." Hum, I suppose that the issue cannot be reproduded with TextIOWrapper, just because io.TextIOWrapper and codecs.StreamReader use different buffer sizes.
msg172536 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-10-09 21:43
> Hum no. The bug is an issue in the design of codecs.Stream* classes: incremental decoders and encoders should be used instead of classic decoders/encoders. I don't understand you. StreamReader and IncrementalDecoder both use the same decoder. class IncrementalDecoder(codecs.BufferedIncrementalDecoder): _buffer_decode = codecs.utf_16_le_decode class StreamReader(codecs.StreamReader): decode = codecs.utf_16_le_decode > I don't want to fix this issue: it's better to move to the io library for the reasons listed in the PEP 400. The bug in utf-16 decoder, not in codecs.StreamReader.
msg172537 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-10-09 21:46
> I don't understand you. Read my last message, I was wrong.
msg172558 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2012-10-10 09:49
> >>> codecs.utf_8_decode('\u20ac'.encode('utf8')[:2]) > ('', 0) > > Oh... codecs.CODEC_decode are incremental decoders? I misunderstood completly this. No, those function are not decoders, they're just helper functions used to implement the real incremental decoders. That's why they're undocumented. Whether codecs.utf_8_decode() returns partial results or raises an exception depends on the final argument:: >>> s = '\u20ac'.encode('utf8')[:2] >>> codecs.utf_8_decode(s, 'strict') ('', 0) >>> codecs.utf_8_decode(s, 'strict', False) ('', 0) >>> codecs.utf_8_decode(s, 'strict', True) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data If you look at encodings/utf_8.py you see that the stateless decoder call codecs.utf_8_decode() with final==True:: def decode(input, errors='strict'): return codecs.utf_8_decode(input, errors, True) so the stateless decoder will raise exceptions for partial results. The incremental decoder simply passed on the final argument given to its encode() method.

History
Date	User	Action	Args
2022-04-11 14:57:32	admin	set	github: 59483
2013-01-07 17:54:49	serhiy.storchaka	set	status: open -> closed superseder: UTF-16 incremental decoder doesn't support partial surrogate pair resolution: duplicate stage: patch review -> resolved
2012-10-24 09:32:48	serhiy.storchaka	set	keywords: + needs review stage: patch review
2012-10-10 09:49:29	doerwalter	set	nosy: + doerwalter messages: + msg172558
2012-10-09 21:46:17	vstinner	set	messages: + msg172537
2012-10-09 21:43:56	serhiy.storchaka	set	messages: + msg172536
2012-10-09 21:39:20	vstinner	set	messages: + msg172535
2012-10-09 21:34:22	serhiy.storchaka	set	messages: + msg172532
2012-10-09 21:19:47	vstinner	set	messages: + msg172530
2012-10-09 21:17:31	vstinner	set	messages: + msg172529
2012-10-09 21:11:47	vstinner	set	messages: + msg172527
2012-10-08 21:15:22	pitrou	set	nosy: + vstinner
2012-10-08 16:10:09	serhiy.storchaka	set	files: + utf16_partial_decode-3.3.patch, utf16_partial_decode-3.2.patch, utf16_partial_decode-2.7.patch keywords: + patch messages: + msg172392
2012-10-08 16:07:16	serhiy.storchaka	set	versions: + Python 3.3, Python 3.4, - Python 2.6, Python 3.1 nosy: + ezio.melotti, serhiy.storchaka messages: + msg172391 components: + Unicode, - Library (Lib)
2012-10-08 11:19:25	Marcus.Gröber	set	nosy: + Marcus.Gröber messages: + msg172372
2012-07-07 15:18:38	lovelylain	create