This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author vstinner
Recipients pitrou, vstinner, warner
Date 2010-11-09.01:44:04
SpamBayes Score 7.872794e-08
Marked as misclassified No
Message-id <1289267047.62.0.152857515884.issue10370@psf.upfronthosting.co.za>
In-reply-to
Content
The error occurs in .readline(): .readline() fills a buffer by reading the file chunk by chunk. Each time a chunk is read, it is decoded by the stateful decoder. The problem is that the decoder doesn't know the file offset. Even if it knew, start and end attributes of UnicodeDecodeError are indexes in the (bytes) object.

> but reports an error at offset 4096 (reported as "0")

4096 is the buffer_size attribute of BufferedReader: .readline() -> ._read_chunk() -> .buffer.read1().

> The misreported offset does not occur with read(), just with readlines().

.read() is special: it reads the whole file at once, and decode binary content at once.

--

I don't consider this issue as a bug, and so I'm closing it as invalid.

--

Use .readline() to locate an invalid byte is not the right algorithm. If you would like to do that, you should open the file in binary mode and decodes the content yourself, chunk by chunk. Or if you manipulate small files, you can use .read() as you wrote.
History
Date User Action Args
2010-11-09 01:44:07vstinnersetrecipients: + vstinner, warner, pitrou
2010-11-09 01:44:07vstinnersetmessageid: <1289267047.62.0.152857515884.issue10370@psf.upfronthosting.co.za>
2010-11-09 01:44:05vstinnerlinkissue10370 messages
2010-11-09 01:44:04vstinnercreate