classification
Title: py3 readlines() reports wrong offset for UnicodeDecodeError
Type: behavior Stage:
Components: IO Versions: Python 3.1, Python 3.2
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: eric.araujo, ezio.melotti, pitrou, vstinner, warner
Priority: normal Keywords:

Created on 2010-11-09 01:05 by warner, last changed 2010-11-21 03:30 by eric.araujo. This issue is now closed.

Files
File name Uploaded Description Edit
test.py warner, 2010-11-09 01:05 test case
Messages (3)
msg120830 - (view) Author: Brian Warner (warner) Date: 2010-11-09 01:05
I noticed that the UnicodeDecodeError exception produced by trying to do open(fn).readlines() (i.e. using the default ASCII encoding) on a file that's actually UTF-8 reports the wrong offset for the first undecodeable character. From what I can tell, it reports (offset%4096) instead of the actual offset.

I've attached a test case. It emits "all good" when run against py2.x (well, after converting the print() expressions back into statements), but reports an error at offset 4096 (reported as "0") on py3.1.2 and py3.2a3 . I'm running on a debian (sid) x86 box.

The misreported offset does not occur with read(), just with readlines().
msg120832 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-11-09 01:44
The error occurs in .readline(): .readline() fills a buffer by reading the file chunk by chunk. Each time a chunk is read, it is decoded by the stateful decoder. The problem is that the decoder doesn't know the file offset. Even if it knew, start and end attributes of UnicodeDecodeError are indexes in the (bytes) object.

> but reports an error at offset 4096 (reported as "0")

4096 is the buffer_size attribute of BufferedReader: .readline() -> ._read_chunk() -> .buffer.read1().

> The misreported offset does not occur with read(), just with readlines().

.read() is special: it reads the whole file at once, and decode binary content at once.

--

I don't consider this issue as a bug, and so I'm closing it as invalid.

--

Use .readline() to locate an invalid byte is not the right algorithm. If you would like to do that, you should open the file in binary mode and decodes the content yourself, chunk by chunk. Or if you manipulate small files, you can use .read() as you wrote.
msg120892 - (view) Author: Brian Warner (warner) Date: 2010-11-09 19:33
> Use .readline() to locate an invalid byte is not the right algorithm. If
> you would like to do that, you should open the file in binary mode and
> decodes the content yourself, chunk by chunk. Or if you manipulate small
> files, you can use .read() as you wrote.

Oh, I agree that readline() is inappropriate as a validation tool. My
specific complaint is that the error message is misleading. I hit a message
like this:

 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 26: ordinal not in range(128)

and wanted to know if the file was UTF-8, or latin-1, or some other encoding,
so I wanted to see that 0xe2 in context. The message said to look at offset
26, but the actual problem might be at 4122, or 8218, etc. It took me several
minutes (and hexdump and grepping for ' e2 ') to find the character and
figure out what was going on.

Perhaps, if the error message cannot report a correct offset, then it
shouldn't be reporting an offset at all.
History
Date User Action Args
2010-11-21 03:30:24eric.araujosetnosy: + eric.araujo
2010-11-09 19:34:40ezio.melottisetnosy: + ezio.melotti
2010-11-09 19:33:38warnersetmessages: + msg120892
2010-11-09 01:44:06vstinnersetstatus: open -> closed

nosy: + vstinner
messages: + msg120832

resolution: not a bug
2010-11-09 01:06:55r.david.murraysetnosy: + pitrou
2010-11-09 01:05:14warnercreate