Message 120832 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	pitrou, vstinner, warner
Date	2010-11-09.01:44:04
SpamBayes Score	7.872794e-08
Marked as misclassified	No
Message-id	<1289267047.62.0.152857515884.issue10370@psf.upfronthosting.co.za>
In-reply-to

Content
The error occurs in .readline(): .readline() fills a buffer by reading the file chunk by chunk. Each time a chunk is read, it is decoded by the stateful decoder. The problem is that the decoder doesn't know the file offset. Even if it knew, start and end attributes of UnicodeDecodeError are indexes in the (bytes) object. > but reports an error at offset 4096 (reported as "0") 4096 is the buffer_size attribute of BufferedReader: .readline() -> ._read_chunk() -> .buffer.read1(). > The misreported offset does not occur with read(), just with readlines(). .read() is special: it reads the whole file at once, and decode binary content at once. -- I don't consider this issue as a bug, and so I'm closing it as invalid. -- Use .readline() to locate an invalid byte is not the right algorithm. If you would like to do that, you should open the file in binary mode and decodes the content yourself, chunk by chunk. Or if you manipulate small files, you can use .read() as you wrote.

The error occurs in .readline(): .readline() fills a buffer by reading the file chunk by chunk. Each time a chunk is read, it is decoded by the stateful decoder. The problem is that the decoder doesn't know the file offset. Even if it knew, start and end attributes of UnicodeDecodeError are indexes in the (bytes) object.

> but reports an error at offset 4096 (reported as "0")

4096 is the buffer_size attribute of BufferedReader: .readline() -> ._read_chunk() -> .buffer.read1().

> The misreported offset does not occur with read(), just with readlines().

.read() is special: it reads the whole file at once, and decode binary content at once.

--

I don't consider this issue as a bug, and so I'm closing it as invalid.

--

Use .readline() to locate an invalid byte is not the right algorithm. If you would like to do that, you should open the file in binary mode and decodes the content yourself, chunk by chunk. Or if you manipulate small files, you can use .read() as you wrote.

History
Date	User	Action	Args
2010-11-09 01:44:07	vstinner	set	recipients: + vstinner, warner, pitrou
2010-11-09 01:44:07	vstinner	set	messageid: <1289267047.62.0.152857515884.issue10370@psf.upfronthosting.co.za>
2010-11-09 01:44:05	vstinner	link	issue10370 messages
2010-11-09 01:44:04	vstinner	create