Message 120892 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	warner
Recipients	pitrou, vstinner, warner
Date	2010-11-09.19:33:38
SpamBayes Score	1.3554334e-08
Marked as misclassified	No
Message-id	<1289331220.96.0.543093969258.issue10370@psf.upfronthosting.co.za>
In-reply-to

Content
> Use .readline() to locate an invalid byte is not the right algorithm. If > you would like to do that, you should open the file in binary mode and > decodes the content yourself, chunk by chunk. Or if you manipulate small > files, you can use .read() as you wrote. Oh, I agree that readline() is inappropriate as a validation tool. My specific complaint is that the error message is misleading. I hit a message like this: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 26: ordinal not in range(128) and wanted to know if the file was UTF-8, or latin-1, or some other encoding, so I wanted to see that 0xe2 in context. The message said to look at offset 26, but the actual problem might be at 4122, or 8218, etc. It took me several minutes (and hexdump and grepping for ' e2 ') to find the character and figure out what was going on. Perhaps, if the error message cannot report a correct offset, then it shouldn't be reporting an offset at all.

> Use .readline() to locate an invalid byte is not the right algorithm. If
> you would like to do that, you should open the file in binary mode and
> decodes the content yourself, chunk by chunk. Or if you manipulate small
> files, you can use .read() as you wrote.

Oh, I agree that readline() is inappropriate as a validation tool. My
specific complaint is that the error message is misleading. I hit a message
like this:

 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 26: ordinal not in range(128)

and wanted to know if the file was UTF-8, or latin-1, or some other encoding,
so I wanted to see that 0xe2 in context. The message said to look at offset
26, but the actual problem might be at 4122, or 8218, etc. It took me several
minutes (and hexdump and grepping for ' e2 ') to find the character and
figure out what was going on.

Perhaps, if the error message cannot report a correct offset, then it
shouldn't be reporting an offset at all.

History
Date	User	Action	Args
2010-11-09 19:33:41	warner	set	recipients: + warner, pitrou, vstinner
2010-11-09 19:33:40	warner	set	messageid: <1289331220.96.0.543093969258.issue10370@psf.upfronthosting.co.za>
2010-11-09 19:33:38	warner	link	issue10370 messages
2010-11-09 19:33:38	warner	create