classification
Title: Force BOM option in UTF output.
Type: enhancement Stage:
Components: Unicode Versions: Python 2.6, Python 2.5
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: doerwalter Nosy List: Rhamphoryncus, doerwalter, ggenellina, jafo, jgsack
Priority: normal Keywords:

Created on 2007-10-25 22:59 by jgsack, last changed 2008-03-22 14:44 by doerwalter. This issue is now closed.

Messages (19)
msg56759 - (view) Author: James G. sack (jim) (jgsack) Date: 2007-10-25 22:59
The behavior of codecs utf_16_[bl]e is to omit the BOM.

In a testing environment (and perhaps elsewhere), a forced BOM is useful.
I'm requesting an optional argument something like
 force_BOM=False

I guess it would require such an option in multiple function calls, sorry I 
don't know enough to itemize them.

If this is implemented, it might be desirable to think about the aliases 
like unicode*unmarked.

Regards,
..jim
msg56780 - (view) Author: James G. sack (jim) (jgsack) Date: 2007-10-26 06:33
Feature Request REVISION
========================
Upon reflection and more playing around with some test cases, I wish to 
revise my feature request.

I think the utf8 codecs should accept input with or without the "sig".
On output, only the utf_8_sig should write the 3-byte "sig". This behavior 
change would not seem disruptive to current applications. 

For utf16, (arguably) a missing BOM should merely assume machian endianess.
For utf_16_le, utf_16_be input, both should accept & discard a BOM.
On output, I'm not sure; maybe all should write a BOM unless passed a flag 
signifying no bom? 
Or to preserve backward compat, could have a parm write_bom defaulting to 
True for utf16 and False for utf_16_le and utf_16_be. This is a 
modification of the originial request (for a force_bom flag).  

Unless I have confused myself with my test cases, the current codecs are 
slightly inconsistent for the utf8 codecs:

utf8 treats "sig" as real data, if present, but..
utf_8_sig works right even without the "sig" (so this one I like as is!)

The 16'ers seem to match the (inferred) specs, but for completeness here:
utf_16 refuses to proceed w/o BOM (even with correct endian input data)
utf_16_le treats BOM as data
utf_16_be treats BOM as data

Regards,
..jim
msg56782 - (view) Author: James G. sack (jim) (jgsack) Date: 2007-10-26 06:43
Later note: kind of weird!

On my LE machine, utf16 reads my BE-formatted test data (no BOM) 
apparently assumng some kind of surrogate format, until it finds 
an "illegal UTF-16 surrogate".

That I fail to understand, especially since it quits upon seeing 
a BOM with valid LE data.

Test data and test code available on request.

Regards,
..jim
msg56801 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-10-26 17:44
Can't you force a BOM by simply writing \ufffe at the start of the file?
msg56813 - (view) Author: James G. sack (jim) (jgsack) Date: 2007-10-26 19:34
re: msg56782

Yes, of course I can explicitly write the BOM. I did realize that after 
my first post ( my-'duh' :-[ ).

But after playing some more, I do think this issue has become a 
worthwhile one. My second post msg56780 asks that utf_8 be tolerant 
of the 3-byte sig BOM, and uf_16_[be]e be tolerant of their BOMs, 
which I argue is consistent with "be liberal on what you accept".

A second half of that message suggests that it might be worth 
considering something like a write_bom parameter with utf_16 
defaulting to True, and utf_16_[bl]e defaulting to False.

My  third post (m56782) may actually represent a bug. I have a 
unittest for this and would be glad to provide (although I need 
to reduuce a larger test to a simple case). I will look at this 
again, and re-pester you as required.

Regards (and thanks for the reply),
..jim
msg56814 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-10-26 19:36
If you can, please submit a patch that fixes all those issues, with
unit tests and doc changes if at all possible. That will make it much
easier to evaluate the ramifications of your proposal(s).
msg56817 - (view) Author: James G. sack (jim) (jgsack) Date: 2007-10-26 19:54
OK, I will work on it. I have just downloaded trunk and will see what 
I can do. Might be a week or two.

..jim
msg57028 - (view) Author: Adam Olsen (Rhamphoryncus) Date: 2007-11-01 19:07
The problem with "being tolerate" as you suggest is you lose the ability
to round-trip.  Read in a file using the UTF-8 signature, write it back
out, and suddenly nothing else can open it.

Conceptually, these signatures shouldn't even be part of the encoding;
they're a prefix in the file indicating which encoding to use.

Note that the BOM signature (ZWNBSP) is a valid code point.  Although it
seems unlikely for a file to start with ZWNBSP, if were to chop a file
up into smaller chunks and decode them individually you'd be more likely
to run into it.  (However, it seems general use of ZWNBSP is being
discouraged precisely due to this potential for confusion[1]).

In summary, guessing the encoding should never be the default.  Although
it may be appropriate in some contexts, we must ensure we emit the right
encoding for those contexts as well. [2]

[1] http://unicode.org/faq/utf_bom.html#38
[2] http://unicode.org/faq/utf_bom.html#28
msg57033 - (view) Author: James G. sack (jim) (jgsack) Date: 2007-11-01 19:56
Adam Olsen wrote:
> Adam Olsen added the comment:
> 
> The problem with "being tolerate" as you suggest is you lose the ability
> to round-trip.  Read in a file using the UTF-8 signature, write it back
> out, and suddenly nothing else can open it.

I'm sorry, I don't see the round-trip problem you describe.

If codec utf_8 or utf_8_sig were to accept input with or without the
3-byte BOM, and write it as currently specified without/with the BOM
respectively, then _I_ can reread again with either utf_8 or utf_8_sig.

No round trip problem _for me_.

Now If I need to exchange with some else, that's a different matter. One
way or another I need to know what format they need and create the
output they require for their input.

Am I missing something in your statement of a problem?

> Conceptually, these signatures shouldn't even be part of the encoding;
> they're a prefix in the file indicating which encoding to use.

Yes, I'm aware of that, but you can't predict what you may find in dusty
archives, or what someone may give to you. IMO, that's the basis of
being tolerant in what you accept, is it not?

> Note that the BOM signature (ZWNBSP) is a valid code point.  Although it
> seems unlikely for a file to start with ZWNBSP, if were to chop a file
> up into smaller chunks and decode them individually you'd be more likely
> to run into it.  (However, it seems general use of ZWNBSP is being
> discouraged precisely due to this potential for confusion[1]).

I understand that throwing away a ZWNBSP at the beginning of a file does
risk discarding data rather than metadata. I also believe the standards
people recognized that and deliberately picked a BOM character that is a
calculated low risk. I'm willing to accept that risk.

> In summary, guessing the encoding should never be the default.  Although
> it may be appropriate in some contexts, we must ensure we emit the right
> encoding for those contexts as well. [2]
> 
> [1] http://unicode.org/faq/utf_bom.html#38
> [2] http://unicode.org/faq/utf_bom.html#28

From my point of view, I don't see that being tolerant in what _I_ (or
my applications) accept violates any guidelines.

Please explain where I am wrong.

Regards,
..jim
msg57041 - (view) Author: Adam Olsen (Rhamphoryncus) Date: 2007-11-01 22:21
On 11/1/07, James G. sack (jim) <report@bugs.python.org> wrote:
>
> James G. sack (jim) added the comment:
>
> Adam Olsen wrote:
> > Adam Olsen added the comment:
> >
> > The problem with "being tolerate" as you suggest is you lose the ability
> > to round-trip.  Read in a file using the UTF-8 signature, write it back
> > out, and suddenly nothing else can open it.
>
> I'm sorry, I don't see the round-trip problem you describe.
>
> If codec utf_8 or utf_8_sig were to accept input with or without the
> 3-byte BOM, and write it as currently specified without/with the BOM
> respectively, then _I_ can reread again with either utf_8 or utf_8_sig.
>
> No round trip problem _for me_.
>
> Now If I need to exchange with some else, that's a different matter. One
> way or another I need to know what format they need and create the
> output they require for their input.
>
> Am I missing something in your statement of a problem?

You don't seem to think it's important to interact with other
programs.  If you're importing with no intent to write out to a common
format, then yes, autodetecting the BOM is just fine.  Python needs a
more general default though, and not guessing is part of that.

> > Conceptually, these signatures shouldn't even be part of the encoding;
> > they're a prefix in the file indicating which encoding to use.
>
> Yes, I'm aware of that, but you can't predict what you may find in dusty
> archives, or what someone may give to you. IMO, that's the basis of
> being tolerant in what you accept, is it not?

Garbage in, garbage out.  There's a lot of protocols with whitespace,
capitalization, etc that you can fudge around while retaining the same
contents; character set encodings aren't one of them.
msg57522 - (view) Author: James G. sack (jim) (jgsack) Date: 2007-11-15 08:40
re: msg57041, I'm sorry if I gave the wrong impression about interacting 
with other programs. I started this feature request with some half-baked 
thinking, which I tried to revise in my second post.

Anyway I'm most interested right now in lobbying for a change to utf_8 to 
accept input with an _optional_ BOM-signature so that the input part would 
behave just like utf_8_sig, where the BOM-sig is already optional (on 
input).

In the process of trying to come up with a test and patch for this, I 
discovered a bug in utf_8_sig (issue #1444 http://bugs.python.org/
issue1444).

After there is some action on that I will return here to continue with 
utf_8, which I have convinced myself (anyways) is a reasonable and safe 
revision.

..jim
msg57527 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2007-11-15 12:57
jgsack wrote:
>
> If codec utf_8 or utf_8_sig were to accept input with or without the
> 3-byte BOM, and write it as currently specified without/with the BOM
> respectively, then _I_ can reread again with either utf_8 or utf_8_sig.

That's exactly what the utf_8_sig codec does. The decoder accepts input
with or without the BOM (the (first) BOM doesn't get returned). The
encoder always prepends a BOM.

Or do you want a codec that behaves like utf_8 on reading and like
utf_8_sig on writing? Such a codec indead indead wouldn't roundtrip.
msg57529 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2007-11-15 13:41
> For utf16, (arguably) a missing BOM should merely assume machian
endianess.
> For utf_16_le, utf_16_be input, both should accept & discard a BOM.
> On output, I'm not sure; maybe all should write a BOM unless passed a flag
> signifying no bom?
> Or to preserve backward compat, could have a parm write_bom defaulting to
> True for utf16 and False for utf_16_le and utf_16_be. This is a 
> modification of the originial request (for a force_bom flag).

The Unicode FAQ (http://unicode.org/faq/utf_bom.html#28) clearly states:

"""
Q: How I should deal with BOMs?
[...]
Where the precise type of the data stream is known (e.g. Unicode
big-endian or Unicode little-endian), the BOM should not be used. In
particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE,
UTF-32BE or UTF-32LE a BOM *must* not be used. [...]
msg57691 - (view) Author: James G. sack (jim) (jgsack) Date: 2007-11-20 03:39
More discussion of utf_8.py decoding behavior (and possible change):

For my needs, I would like the decoding parts of the utf_8 module to treat 
an initial BOM as an optional signature and skip it if there is one (just 
like the utf_8_sig decoder). In fact I have a working patch that replaces 
the utf_8_sig  decode, IncrementalDecoder and StreamReader components by 
direct transplants from utf_8_sig (as recently repaired -- there was a 
SteamReader error).

However the reason for discussion is to ask how it might impact existing 
code.

I can imagine there might be utf_8 client code out there which expects to 
see a leading U+feff as (perhaps) a clue that the output should be returned 
with a BOM-signature (say) to accomodate the guessed input requirements of 
the remote correspondant.

Making my work easier might actually make someone else's work (probably, 
annoyingly) harder. 

So what to do?

I can just live with code like
  if input[0] == u"\ufeff": 
    input=input[1:}
spread around, and of course slightly different for incremental and stream 
inputs. 
  
  But I probably wouldn't. I would probably substitute a
  "my_utf_8" encoding for to make my code a little cleaner.

Another thought I had would require "the other guy" to update his code, but 
at least it wouldn't make his work annoyingly difficult like my original 
change might have.

Here's the basic outline:

- Add another decoder function that returns a 3-tuple
  decode3(input, errors='strict') => (data, consumed, had_bom)
where had_bom is true if a leading bom was seen and skipped

- then the usual decode is just something like
  def decode(input, errors='strict'):
    return decode3(input, errors)[:2]

- add member variable and accessor to both IncrementalDecoder and 
StreamReader classes something like
  def had_bom(self):
    return self.had_bom
and initialize/set the self.had_bom variable as required.

This complicates the interface somewhat and requires some additional 
documantation.

   Tpo document my original simple [-minded] idea required 
   possibly only a few more words in the existing paragraph
   on utf_8_sig, to mention that both mods had the same 
   decoding behavior but different encoding.

I thought of a secondary consideration: If utf_8 and utf_8_sig are "almost 
the same", it's possible that future refactoring might unify them with 
differences contained in behavor-flags (eg, skip_leading_bom). The leading 
bom processing might even be pushed into codecs.utf_8_decode for possible 
minor advantages. 

Is there anybody monitoring this who has an opinion on this? 

..jim
msg63705 - (view) Author: Sean Reifschneider (jafo) * (Python committer) Date: 2008-03-17 18:18
It sounds like the Unicode FAQ has an authoritative statement on this,
is this a "wontfix", or does this need more discussion?  Perhaps on
python-dev or at the sprints this week?
msg64189 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2008-03-20 18:16
I don't see exactly what James is proposing.

> For my needs, I would like the decoding parts of the utf_8 module
> to treat an initial BOM as an optional signature and skip it if
> there is one (just like the utf_8_sig decoder). In fact I have
> a working patch that replaces the utf_8_sig  decode,
> IncrementalDecoder and StreamReader components by direct
> transplants from utf_8_sig (as recently repaired -- there was a
> SteamReader error).

I've you want a decoder that behave like the utf-8-sig decoder, use the
utf-8-sig decoder. I don't see how changing the utf-8 decoder helps here.

> I can imagine there might be utf_8 client code out there which
> expects to see a leading U+feff as (perhaps) a clue that the
> output should be returned with a BOM-signature (say) to
> accomodate the guessed input requirements of the remote
> correspondant.

In this case use UTF-8: The leading BOM will be passed to the application.

> I can just live with code like
>  if input[0] == u"\ufeff": 
>    input=input[1:}
> spread around, and of course slightly different for incremental
> and stream inputs.

Can you post an example that requires this code?
msg64217 - (view) Author: James G. sack (jim) (jgsack) Date: 2008-03-20 22:21
> Can you post an example that requires this code?

This is not a big issue, and it wouldn't hurt if it got declared "go away 
and come back later if you have patch, test, docs, and a convincing use 
case". 

..But, for the record..

Suppose I want to both read and write some utf8. It is unknown whether the 
input has a BOM, but it is known to be utf8. I want to write utf8 without 
any BOM. I see two options, which I find slightly ugly/annoying/error-prone:

a) Use 2 separate encodings: read via utf_8_sig so as to transparently 
accept input with/without BOM; use utf_8 on output to not emit any BOM. 

b) Use utf_8 for read and write and explicitly check for and discard 
leading BOM on input if any.

What _I_ would prefer is that utf_8 would ignore a BOM, if present (just 
like utf_8_sig). 

(What I was talking about in my last post was a complication in 
consideration of someone else who would prefer otherwise, or of code that 
might break upon my change.)

Regards,
..jim
msg64324 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2008-03-22 14:42
If you want to use UTF-8-sig for decoding and UTF-8 for encoding and
have this available as one codec you can define your owen codec for this:

import codecs

def search_function(name):
    if name == "myutf8":
        utf8 = codecs.lookup("utf-8")
        utf8_sig = codecs.lookup("utf-8-sig")
        return codecs.CodecInfo(
            name='myutf8',
            encode=utf8.encode,
            decode=utf8_sig.decode,
            incrementalencoder=utf8.IncrementalEncoder,
            incrementaldecoder=utf8_sig.IncrementalDecoder,
            streamreader=utf8_sig.StreamReader,
            streamwriter=utf8.StreamWriter,
        )


codecs.register(search_function)

Closing the issue as "wont fix"
msg64325 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2008-03-22 14:44
Oops, that code was supposed to read:

import codecs

def search_function(name):
    if name == "myutf8":
        utf8 = codecs.lookup("utf-8")
        utf8_sig = codecs.lookup("utf-8-sig")
        return codecs.CodecInfo(
            name='myutf8',
            encode=utf8.encode,
            decode=utf8_sig.decode,
            incrementalencoder=utf8.incrementalencoder,
            incrementaldecoder=utf8_sig.incrementaldecoder,
            streamreader=utf8_sig.streamreader,
            streamwriter=utf8.streamwriter,
        )


codecs.register(search_function)
History
Date User Action Args
2008-03-22 14:44:52doerwaltersetmessages: + msg64325
2008-03-22 14:42:01doerwaltersetstatus: open -> closed
resolution: wont fix
messages: + msg64324
2008-03-20 22:21:41jgsacksetmessages: + msg64217
2008-03-20 18:16:12doerwaltersetmessages: + msg64189
2008-03-17 18:18:56jafosettitle: feature request: force BOM option -> Force BOM option in UTF output.
nosy: + jafo
messages: + msg63705
priority: normal
assignee: doerwalter
type: behavior -> enhancement
2007-11-20 22:23:45gvanrossumsetnosy: - gvanrossum
2007-11-20 03:39:57jgsacksetmessages: + msg57691
versions: + Python 2.6
2007-11-15 13:41:57doerwaltersetmessages: + msg57529
2007-11-15 12:57:09doerwaltersetnosy: + doerwalter
messages: + msg57527
2007-11-15 08:40:50jgsacksetmessages: + msg57522
2007-11-01 22:21:34Rhamphoryncussetmessages: + msg57041
2007-11-01 19:56:30jgsacksetmessages: + msg57033
2007-11-01 19:07:38Rhamphoryncussetnosy: + Rhamphoryncus
messages: + msg57028
2007-10-28 01:40:09ggenellinasetnosy: + ggenellina
2007-10-26 19:54:12jgsacksetmessages: + msg56817
2007-10-26 19:36:45gvanrossumsetmessages: + msg56814
2007-10-26 19:34:47jgsacksetmessages: + msg56813
2007-10-26 17:44:31gvanrossumsetnosy: + gvanrossum
messages: + msg56801
2007-10-26 06:43:59jgsacksetmessages: + msg56782
2007-10-26 06:33:33jgsacksetmessages: + msg56780
2007-10-25 22:59:36jgsackcreate