Message 57028 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Rhamphoryncus
Recipients	Rhamphoryncus, ggenellina, gvanrossum, jgsack
Date	2007-11-01.19:07:37
SpamBayes Score	0.20530954
Marked as misclassified	No
Message-id	<1193944058.05.0.212980178301.issue1328@psf.upfronthosting.co.za>
In-reply-to

Content
The problem with "being tolerate" as you suggest is you lose the ability to round-trip. Read in a file using the UTF-8 signature, write it back out, and suddenly nothing else can open it. Conceptually, these signatures shouldn't even be part of the encoding; they're a prefix in the file indicating which encoding to use. Note that the BOM signature (ZWNBSP) is a valid code point. Although it seems unlikely for a file to start with ZWNBSP, if were to chop a file up into smaller chunks and decode them individually you'd be more likely to run into it. (However, it seems general use of ZWNBSP is being discouraged precisely due to this potential for confusion[1]). In summary, guessing the encoding should never be the default. Although it may be appropriate in some contexts, we must ensure we emit the right encoding for those contexts as well. [2] [1] http://unicode.org/faq/utf_bom.html#38 [2] http://unicode.org/faq/utf_bom.html#28

The problem with "being tolerate" as you suggest is you lose the ability
to round-trip.  Read in a file using the UTF-8 signature, write it back
out, and suddenly nothing else can open it.

Conceptually, these signatures shouldn't even be part of the encoding;
they're a prefix in the file indicating which encoding to use.

Note that the BOM signature (ZWNBSP) is a valid code point.  Although it
seems unlikely for a file to start with ZWNBSP, if were to chop a file
up into smaller chunks and decode them individually you'd be more likely
to run into it.  (However, it seems general use of ZWNBSP is being
discouraged precisely due to this potential for confusion[1]).

In summary, guessing the encoding should never be the default.  Although
it may be appropriate in some contexts, we must ensure we emit the right
encoding for those contexts as well. [2]

[1] http://unicode.org/faq/utf_bom.html#38
[2] http://unicode.org/faq/utf_bom.html#28

History
Date	User	Action	Args
2007-11-01 19:07:38	Rhamphoryncus	set	spambayes_score: 0.20531 -> 0.20530954 recipients: + Rhamphoryncus, gvanrossum, jgsack, ggenellina
2007-11-01 19:07:38	Rhamphoryncus	set	spambayes_score: 0.20531 -> 0.20531 messageid: <1193944058.05.0.212980178301.issue1328@psf.upfronthosting.co.za>
2007-11-01 19:07:38	Rhamphoryncus	link	issue1328 messages
2007-11-01 19:07:37	Rhamphoryncus	create