Message 57041 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Rhamphoryncus
Recipients	Rhamphoryncus, ggenellina, gvanrossum, jgsack
Date	2007-11-01.22:21:33
SpamBayes Score	0.015951002
Marked as misclassified	No
Message-id	<aac2c7cb0711011521q617740b0sf0596ab435cbf615@mail.gmail.com>
In-reply-to	<472A2F3C.7020205@san.rr.com>

Content
On 11/1/07, James G. sack (jim) <report@bugs.python.org> wrote: > > James G. sack (jim) added the comment: > > Adam Olsen wrote: > > Adam Olsen added the comment: > > > > The problem with "being tolerate" as you suggest is you lose the ability > > to round-trip. Read in a file using the UTF-8 signature, write it back > > out, and suddenly nothing else can open it. > > I'm sorry, I don't see the round-trip problem you describe. > > If codec utf_8 or utf_8_sig were to accept input with or without the > 3-byte BOM, and write it as currently specified without/with the BOM > respectively, then _I_ can reread again with either utf_8 or utf_8_sig. > > No round trip problem _for me_. > > Now If I need to exchange with some else, that's a different matter. One > way or another I need to know what format they need and create the > output they require for their input. > > Am I missing something in your statement of a problem? You don't seem to think it's important to interact with other programs. If you're importing with no intent to write out to a common format, then yes, autodetecting the BOM is just fine. Python needs a more general default though, and not guessing is part of that. > > Conceptually, these signatures shouldn't even be part of the encoding; > > they're a prefix in the file indicating which encoding to use. > > Yes, I'm aware of that, but you can't predict what you may find in dusty > archives, or what someone may give to you. IMO, that's the basis of > being tolerant in what you accept, is it not? Garbage in, garbage out. There's a lot of protocols with whitespace, capitalization, etc that you can fudge around while retaining the same contents; character set encodings aren't one of them.

On 11/1/07, James G. sack (jim) <report@bugs.python.org> wrote:
>
> James G. sack (jim) added the comment:
>
> Adam Olsen wrote:
> > Adam Olsen added the comment:
> >
> > The problem with "being tolerate" as you suggest is you lose the ability
> > to round-trip.  Read in a file using the UTF-8 signature, write it back
> > out, and suddenly nothing else can open it.
>
> I'm sorry, I don't see the round-trip problem you describe.
>
> If codec utf_8 or utf_8_sig were to accept input with or without the
> 3-byte BOM, and write it as currently specified without/with the BOM
> respectively, then _I_ can reread again with either utf_8 or utf_8_sig.
>
> No round trip problem _for me_.
>
> Now If I need to exchange with some else, that's a different matter. One
> way or another I need to know what format they need and create the
> output they require for their input.
>
> Am I missing something in your statement of a problem?

You don't seem to think it's important to interact with other
programs.  If you're importing with no intent to write out to a common
format, then yes, autodetecting the BOM is just fine.  Python needs a
more general default though, and not guessing is part of that.

> > Conceptually, these signatures shouldn't even be part of the encoding;
> > they're a prefix in the file indicating which encoding to use.
>
> Yes, I'm aware of that, but you can't predict what you may find in dusty
> archives, or what someone may give to you. IMO, that's the basis of
> being tolerant in what you accept, is it not?

Garbage in, garbage out.  There's a lot of protocols with whitespace,
capitalization, etc that you can fudge around while retaining the same
contents; character set encodings aren't one of them.

History
Date	User	Action	Args
2007-11-01 22:21:34	Rhamphoryncus	set	spambayes_score: 0.015951 -> 0.015951002 recipients: + Rhamphoryncus, gvanrossum, jgsack, ggenellina
2007-11-01 22:21:34	Rhamphoryncus	link	issue1328 messages
2007-11-01 22:21:33	Rhamphoryncus	create