This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Rhamphoryncus
Recipients Rhamphoryncus, ggenellina, gvanrossum, jgsack
Date 2007-11-01.22:21:33
SpamBayes Score 0.015951002
Marked as misclassified No
Message-id <aac2c7cb0711011521q617740b0sf0596ab435cbf615@mail.gmail.com>
In-reply-to <472A2F3C.7020205@san.rr.com>
Content
On 11/1/07, James G. sack (jim) <report@bugs.python.org> wrote:
>
> James G. sack (jim) added the comment:
>
> Adam Olsen wrote:
> > Adam Olsen added the comment:
> >
> > The problem with "being tolerate" as you suggest is you lose the ability
> > to round-trip.  Read in a file using the UTF-8 signature, write it back
> > out, and suddenly nothing else can open it.
>
> I'm sorry, I don't see the round-trip problem you describe.
>
> If codec utf_8 or utf_8_sig were to accept input with or without the
> 3-byte BOM, and write it as currently specified without/with the BOM
> respectively, then _I_ can reread again with either utf_8 or utf_8_sig.
>
> No round trip problem _for me_.
>
> Now If I need to exchange with some else, that's a different matter. One
> way or another I need to know what format they need and create the
> output they require for their input.
>
> Am I missing something in your statement of a problem?

You don't seem to think it's important to interact with other
programs.  If you're importing with no intent to write out to a common
format, then yes, autodetecting the BOM is just fine.  Python needs a
more general default though, and not guessing is part of that.

> > Conceptually, these signatures shouldn't even be part of the encoding;
> > they're a prefix in the file indicating which encoding to use.
>
> Yes, I'm aware of that, but you can't predict what you may find in dusty
> archives, or what someone may give to you. IMO, that's the basis of
> being tolerant in what you accept, is it not?

Garbage in, garbage out.  There's a lot of protocols with whitespace,
capitalization, etc that you can fudge around while retaining the same
contents; character set encodings aren't one of them.
History
Date User Action Args
2007-11-01 22:21:34Rhamphoryncussetspambayes_score: 0.015951 -> 0.015951002
recipients: + Rhamphoryncus, gvanrossum, jgsack, ggenellina
2007-11-01 22:21:34Rhamphoryncuslinkissue1328 messages
2007-11-01 22:21:33Rhamphoryncuscreate