Message 97341 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	vstinner
Date	2010-01-07.03:03:54
SpamBayes Score	2.6084103e-07
Marked as misclassified	No
Message-id	<1262833437.24.0.308267564541.issue7651@psf.upfronthosting.co.za>
In-reply-to

Content
If the file starts with a BOM, open(filename) should be able to guess the charset. It would be helpful for many high level modules: - #7519: ConfigParser - #7185: csv - and any module using open() to read a text file Actually, the user have to choose between UTF-8 and UTF-8-SIG to skip the UTF-8 BOM. For UTF-16, the user have to specify UTF-16-LE or UTF-16-BE, even if the file starts with a BOM (which should be the case most the time). The idea is to delay the creation of the decoder and the encoder. Just after reading the first chunk: try to guess the charset by searching for a BOM (if the charset is unknown). If the BOM is found, fallback to current guess code (os.device_charset() or locale.getpreferredencoding()). Concerned charsets: UTF-8, UTF-16-LE, UTF-16-BE, UTF-32-LE, UTF-32-BE. Binary files are not concerned. If the encoding is specified to open(), the behaviour is unchanged. I wrote a proof of concept, but there are still open issues: - append mode: should we seek at zero to read the BOM? old=tell(); seek(0); bytes=read(4); seek(old); search_bom(bytes) - read+write: should we guess the charset using the BOM if the first action is a write? or only search for a BOM if the first action is a read?

If the file starts with a BOM, open(filename) should be able to guess the charset. It would be helpful for many high level modules:

 - #7519: ConfigParser
 - #7185: csv
 - and any module using open() to read a text file

Actually, the user have to choose between UTF-8 and UTF-8-SIG to skip the UTF-8 BOM. For UTF-16, the user have to specify UTF-16-LE or UTF-16-BE, even if the file starts with a BOM (which should be the case most the time).

The idea is to delay the creation of the decoder and the encoder. Just after reading the first chunk: try to guess the charset by searching for a BOM (if the charset is unknown). If the BOM is found, fallback to current guess code (os.device_charset() or locale.getpreferredencoding()).

Concerned charsets: UTF-8, UTF-16-LE, UTF-16-BE, UTF-32-LE, UTF-32-BE. Binary files are not concerned. If the encoding is specified to open(), the behaviour is unchanged.

I wrote a proof of concept, but there are still open issues:

 - append mode: should we seek at zero to read the BOM?
   old=tell(); seek(0); bytes=read(4); seek(old); search_bom(bytes)
 - read+write: should we guess the charset using the BOM if the first action is a write? or only search for a BOM if the first action is a read?

History
Date	User	Action	Args
2010-01-07 03:03:57	vstinner	set	recipients: + vstinner
2010-01-07 03:03:57	vstinner	set	messageid: <1262833437.24.0.308267564541.issue7651@psf.upfronthosting.co.za>
2010-01-07 03:03:55	vstinner	link	issue7651 messages
2010-01-07 03:03:54	vstinner	create