This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author jgsack
Recipients Rhamphoryncus, ggenellina, gvanrossum, jgsack
Date 2007-11-01.19:56:29
SpamBayes Score 0.014761394
Marked as misclassified No
Message-id <472A2F3C.7020205@san.rr.com>
In-reply-to <1193944058.05.0.212980178301.issue1328@psf.upfronthosting.co.za>
Content
Adam Olsen wrote:
> Adam Olsen added the comment:
> 
> The problem with "being tolerate" as you suggest is you lose the ability
> to round-trip.  Read in a file using the UTF-8 signature, write it back
> out, and suddenly nothing else can open it.

I'm sorry, I don't see the round-trip problem you describe.

If codec utf_8 or utf_8_sig were to accept input with or without the
3-byte BOM, and write it as currently specified without/with the BOM
respectively, then _I_ can reread again with either utf_8 or utf_8_sig.

No round trip problem _for me_.

Now If I need to exchange with some else, that's a different matter. One
way or another I need to know what format they need and create the
output they require for their input.

Am I missing something in your statement of a problem?

> Conceptually, these signatures shouldn't even be part of the encoding;
> they're a prefix in the file indicating which encoding to use.

Yes, I'm aware of that, but you can't predict what you may find in dusty
archives, or what someone may give to you. IMO, that's the basis of
being tolerant in what you accept, is it not?

> Note that the BOM signature (ZWNBSP) is a valid code point.  Although it
> seems unlikely for a file to start with ZWNBSP, if were to chop a file
> up into smaller chunks and decode them individually you'd be more likely
> to run into it.  (However, it seems general use of ZWNBSP is being
> discouraged precisely due to this potential for confusion[1]).

I understand that throwing away a ZWNBSP at the beginning of a file does
risk discarding data rather than metadata. I also believe the standards
people recognized that and deliberately picked a BOM character that is a
calculated low risk. I'm willing to accept that risk.

> In summary, guessing the encoding should never be the default.  Although
> it may be appropriate in some contexts, we must ensure we emit the right
> encoding for those contexts as well. [2]
> 
> [1] http://unicode.org/faq/utf_bom.html#38
> [2] http://unicode.org/faq/utf_bom.html#28

From my point of view, I don't see that being tolerant in what _I_ (or
my applications) accept violates any guidelines.

Please explain where I am wrong.

Regards,
..jim
History
Date User Action Args
2007-11-01 19:56:30jgsacksetspambayes_score: 0.0147614 -> 0.014761394
recipients: + jgsack, gvanrossum, ggenellina, Rhamphoryncus
2007-11-01 19:56:30jgsacklinkissue1328 messages
2007-11-01 19:56:29jgsackcreate