Message 57033 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	jgsack
Recipients	Rhamphoryncus, ggenellina, gvanrossum, jgsack
Date	2007-11-01.19:56:29
SpamBayes Score	0.014761394
Marked as misclassified	No
Message-id	<472A2F3C.7020205@san.rr.com>
In-reply-to	<1193944058.05.0.212980178301.issue1328@psf.upfronthosting.co.za>

Content
Adam Olsen wrote: > Adam Olsen added the comment: > > The problem with "being tolerate" as you suggest is you lose the ability > to round-trip. Read in a file using the UTF-8 signature, write it back > out, and suddenly nothing else can open it. I'm sorry, I don't see the round-trip problem you describe. If codec utf_8 or utf_8_sig were to accept input with or without the 3-byte BOM, and write it as currently specified without/with the BOM respectively, then _I_ can reread again with either utf_8 or utf_8_sig. No round trip problem _for me_. Now If I need to exchange with some else, that's a different matter. One way or another I need to know what format they need and create the output they require for their input. Am I missing something in your statement of a problem? > Conceptually, these signatures shouldn't even be part of the encoding; > they're a prefix in the file indicating which encoding to use. Yes, I'm aware of that, but you can't predict what you may find in dusty archives, or what someone may give to you. IMO, that's the basis of being tolerant in what you accept, is it not? > Note that the BOM signature (ZWNBSP) is a valid code point. Although it > seems unlikely for a file to start with ZWNBSP, if were to chop a file > up into smaller chunks and decode them individually you'd be more likely > to run into it. (However, it seems general use of ZWNBSP is being > discouraged precisely due to this potential for confusion[1]). I understand that throwing away a ZWNBSP at the beginning of a file does risk discarding data rather than metadata. I also believe the standards people recognized that and deliberately picked a BOM character that is a calculated low risk. I'm willing to accept that risk. > In summary, guessing the encoding should never be the default. Although > it may be appropriate in some contexts, we must ensure we emit the right > encoding for those contexts as well. [2] > > [1] http://unicode.org/faq/utf_bom.html#38 > [2] http://unicode.org/faq/utf_bom.html#28 From my point of view, I don't see that being tolerant in what _I_ (or my applications) accept violates any guidelines. Please explain where I am wrong. Regards, ..jim

Adam Olsen wrote:
> Adam Olsen added the comment:
> 
> The problem with "being tolerate" as you suggest is you lose the ability
> to round-trip.  Read in a file using the UTF-8 signature, write it back
> out, and suddenly nothing else can open it.

I'm sorry, I don't see the round-trip problem you describe.

If codec utf_8 or utf_8_sig were to accept input with or without the
3-byte BOM, and write it as currently specified without/with the BOM
respectively, then _I_ can reread again with either utf_8 or utf_8_sig.

No round trip problem _for me_.

Now If I need to exchange with some else, that's a different matter. One
way or another I need to know what format they need and create the
output they require for their input.

Am I missing something in your statement of a problem?

> Conceptually, these signatures shouldn't even be part of the encoding;
> they're a prefix in the file indicating which encoding to use.

Yes, I'm aware of that, but you can't predict what you may find in dusty
archives, or what someone may give to you. IMO, that's the basis of
being tolerant in what you accept, is it not?

> Note that the BOM signature (ZWNBSP) is a valid code point.  Although it
> seems unlikely for a file to start with ZWNBSP, if were to chop a file
> up into smaller chunks and decode them individually you'd be more likely
> to run into it.  (However, it seems general use of ZWNBSP is being
> discouraged precisely due to this potential for confusion[1]).

I understand that throwing away a ZWNBSP at the beginning of a file does
risk discarding data rather than metadata. I also believe the standards
people recognized that and deliberately picked a BOM character that is a
calculated low risk. I'm willing to accept that risk.

> In summary, guessing the encoding should never be the default.  Although
> it may be appropriate in some contexts, we must ensure we emit the right
> encoding for those contexts as well. [2]
> 
> [1] http://unicode.org/faq/utf_bom.html#38
> [2] http://unicode.org/faq/utf_bom.html#28

From my point of view, I don't see that being tolerant in what _I_ (or
my applications) accept violates any guidelines.

Please explain where I am wrong.

Regards,
..jim

History
Date	User	Action	Args
2007-11-01 19:56:30	jgsack	set	spambayes_score: 0.0147614 -> 0.014761394 recipients: + jgsack, gvanrossum, ggenellina, Rhamphoryncus
2007-11-01 19:56:30	jgsack	link	issue1328 messages
2007-11-01 19:56:29	jgsack	create