Message 125940 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	cito, eric.araujo, loewis, rhettinger, vstinner
Date	2011-01-10.22:18:38
SpamBayes Score	4.2621114e-08
Marked as misclassified	No
Message-id	<1294697919.5.0.787147009984.issue1697943@psf.upfronthosting.co.za>
In-reply-to

Content
Extract of the Unicode standard: "Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature". See also the following section explaing issues with UTF-8 BOM: http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8 I agree that Python should handle (UTF-8) BOM to read a CSV file (#7185), because the file format is common on Windows. But msgfmt is an UNIX tool: I would expect that Python behaves like the original msgfmt tool, fail with a fatal error on the BOM "invisible character". How do you explain to a user msgfmt fails but not msgfmt.py? About the patch: ignore the BOM is not a good idea. The BOM announces the encoding (eg. UTF-8): if a Content-Type header announces another encoding, you should raise an error.

Extract of the Unicode standard: "Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature".

See also the following section explaing issues with UTF-8 BOM:
http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8

I agree that Python should handle (UTF-8) BOM to read a CSV file (#7185), because the file format is common on Windows.

But msgfmt is an UNIX tool: I would expect that Python behaves like the original msgfmt tool, fail with a fatal error on the BOM "invisible character". How do you explain to a user msgfmt fails but not msgfmt.py?

About the patch: *ignore* the BOM is not a good idea. The BOM announces the encoding (eg. UTF-8): if a Content-Type header announces another encoding, you should raise an error.

History
Date	User	Action	Args
2011-01-10 22:18:39	vstinner	set	recipients: + vstinner, loewis, rhettinger, cito, eric.araujo
2011-01-10 22:18:39	vstinner	set	messageid: <1294697919.5.0.787147009984.issue1697943@psf.upfronthosting.co.za>
2011-01-10 22:18:38	vstinner	link	issue1697943 messages
2011-01-10 22:18:38	vstinner	create