Message57041
On 11/1/07, James G. sack (jim) <report@bugs.python.org> wrote:
>
> James G. sack (jim) added the comment:
>
> Adam Olsen wrote:
> > Adam Olsen added the comment:
> >
> > The problem with "being tolerate" as you suggest is you lose the ability
> > to round-trip. Read in a file using the UTF-8 signature, write it back
> > out, and suddenly nothing else can open it.
>
> I'm sorry, I don't see the round-trip problem you describe.
>
> If codec utf_8 or utf_8_sig were to accept input with or without the
> 3-byte BOM, and write it as currently specified without/with the BOM
> respectively, then _I_ can reread again with either utf_8 or utf_8_sig.
>
> No round trip problem _for me_.
>
> Now If I need to exchange with some else, that's a different matter. One
> way or another I need to know what format they need and create the
> output they require for their input.
>
> Am I missing something in your statement of a problem?
You don't seem to think it's important to interact with other
programs. If you're importing with no intent to write out to a common
format, then yes, autodetecting the BOM is just fine. Python needs a
more general default though, and not guessing is part of that.
> > Conceptually, these signatures shouldn't even be part of the encoding;
> > they're a prefix in the file indicating which encoding to use.
>
> Yes, I'm aware of that, but you can't predict what you may find in dusty
> archives, or what someone may give to you. IMO, that's the basis of
> being tolerant in what you accept, is it not?
Garbage in, garbage out. There's a lot of protocols with whitespace,
capitalization, etc that you can fudge around while retaining the same
contents; character set encodings aren't one of them. |
|
Date |
User |
Action |
Args |
2007-11-01 22:21:34 | Rhamphoryncus | set | spambayes_score: 0.015951 -> 0.015951002 recipients:
+ Rhamphoryncus, gvanrossum, jgsack, ggenellina |
2007-11-01 22:21:34 | Rhamphoryncus | link | issue1328 messages |
2007-11-01 22:21:33 | Rhamphoryncus | create | |
|