This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Rhamphoryncus
Recipients Rhamphoryncus, ggenellina, gvanrossum, jgsack
Date 2007-11-01.19:07:37
SpamBayes Score 0.20530954
Marked as misclassified No
Message-id <1193944058.05.0.212980178301.issue1328@psf.upfronthosting.co.za>
In-reply-to
Content
The problem with "being tolerate" as you suggest is you lose the ability
to round-trip.  Read in a file using the UTF-8 signature, write it back
out, and suddenly nothing else can open it.

Conceptually, these signatures shouldn't even be part of the encoding;
they're a prefix in the file indicating which encoding to use.

Note that the BOM signature (ZWNBSP) is a valid code point.  Although it
seems unlikely for a file to start with ZWNBSP, if were to chop a file
up into smaller chunks and decode them individually you'd be more likely
to run into it.  (However, it seems general use of ZWNBSP is being
discouraged precisely due to this potential for confusion[1]).

In summary, guessing the encoding should never be the default.  Although
it may be appropriate in some contexts, we must ensure we emit the right
encoding for those contexts as well. [2]

[1] http://unicode.org/faq/utf_bom.html#38
[2] http://unicode.org/faq/utf_bom.html#28
History
Date User Action Args
2007-11-01 19:07:38Rhamphoryncussetspambayes_score: 0.20531 -> 0.20530954
recipients: + Rhamphoryncus, gvanrossum, jgsack, ggenellina
2007-11-01 19:07:38Rhamphoryncussetspambayes_score: 0.20531 -> 0.20531
messageid: <1193944058.05.0.212980178301.issue1328@psf.upfronthosting.co.za>
2007-11-01 19:07:38Rhamphoryncuslinkissue1328 messages
2007-11-01 19:07:37Rhamphoryncuscreate