Issue 1328: Force BOM option in UTF output.

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/45669

classification

Title:	Force BOM option in UTF output.
Type:	enhancement	Stage:
Components:	Unicode	Versions:	Python 2.6, Python 2.5

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:	doerwalter	Nosy List:	Rhamphoryncus, doerwalter, ggenellina, jafo, jgsack
Priority:	normal	Keywords:

Created on 2007-10-25 22:59 by jgsack, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (19)
msg56759 - (view)	Author: James G. sack (jim) (jgsack)	Date: 2007-10-25 22:59
The behavior of codecs utf_16_[bl]e is to omit the BOM. In a testing environment (and perhaps elsewhere), a forced BOM is useful. I'm requesting an optional argument something like force_BOM=False I guess it would require such an option in multiple function calls, sorry I don't know enough to itemize them. If this is implemented, it might be desirable to think about the aliases like unicode*unmarked. Regards, ..jim
msg56780 - (view)	Author: James G. sack (jim) (jgsack)	Date: 2007-10-26 06:33
Feature Request REVISION ======================== Upon reflection and more playing around with some test cases, I wish to revise my feature request. I think the utf8 codecs should accept input with or without the "sig". On output, only the utf_8_sig should write the 3-byte "sig". This behavior change would not seem disruptive to current applications. For utf16, (arguably) a missing BOM should merely assume machian endianess. For utf_16_le, utf_16_be input, both should accept & discard a BOM. On output, I'm not sure; maybe all should write a BOM unless passed a flag signifying no bom? Or to preserve backward compat, could have a parm write_bom defaulting to True for utf16 and False for utf_16_le and utf_16_be. This is a modification of the originial request (for a force_bom flag). Unless I have confused myself with my test cases, the current codecs are slightly inconsistent for the utf8 codecs: utf8 treats "sig" as real data, if present, but.. utf_8_sig works right even without the "sig" (so this one I like as is!) The 16'ers seem to match the (inferred) specs, but for completeness here: utf_16 refuses to proceed w/o BOM (even with correct endian input data) utf_16_le treats BOM as data utf_16_be treats BOM as data Regards, ..jim
msg56782 - (view)	Author: James G. sack (jim) (jgsack)	Date: 2007-10-26 06:43
Later note: kind of weird! On my LE machine, utf16 reads my BE-formatted test data (no BOM) apparently assumng some kind of surrogate format, until it finds an "illegal UTF-16 surrogate". That I fail to understand, especially since it quits upon seeing a BOM with valid LE data. Test data and test code available on request. Regards, ..jim
msg56801 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2007-10-26 17:44
Can't you force a BOM by simply writing \ufffe at the start of the file?
msg56813 - (view)	Author: James G. sack (jim) (jgsack)	Date: 2007-10-26 19:34
re: msg56782 Yes, of course I can explicitly write the BOM. I did realize that after my first post ( my-'duh' :-[ ). But after playing some more, I do think this issue has become a worthwhile one. My second post msg56780 asks that utf_8 be tolerant of the 3-byte sig BOM, and uf_16_[be]e be tolerant of their BOMs, which I argue is consistent with "be liberal on what you accept". A second half of that message suggests that it might be worth considering something like a write_bom parameter with utf_16 defaulting to True, and utf_16_[bl]e defaulting to False. My third post (m56782) may actually represent a bug. I have a unittest for this and would be glad to provide (although I need to reduuce a larger test to a simple case). I will look at this again, and re-pester you as required. Regards (and thanks for the reply), ..jim
msg56814 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2007-10-26 19:36
If you can, please submit a patch that fixes all those issues, with unit tests and doc changes if at all possible. That will make it much easier to evaluate the ramifications of your proposal(s).
msg56817 - (view)	Author: James G. sack (jim) (jgsack)	Date: 2007-10-26 19:54
OK, I will work on it. I have just downloaded trunk and will see what I can do. Might be a week or two. ..jim
msg57028 - (view)	Author: Adam Olsen (Rhamphoryncus)	Date: 2007-11-01 19:07
The problem with "being tolerate" as you suggest is you lose the ability to round-trip. Read in a file using the UTF-8 signature, write it back out, and suddenly nothing else can open it. Conceptually, these signatures shouldn't even be part of the encoding; they're a prefix in the file indicating which encoding to use. Note that the BOM signature (ZWNBSP) is a valid code point. Although it seems unlikely for a file to start with ZWNBSP, if were to chop a file up into smaller chunks and decode them individually you'd be more likely to run into it. (However, it seems general use of ZWNBSP is being discouraged precisely due to this potential for confusion[1]). In summary, guessing the encoding should never be the default. Although it may be appropriate in some contexts, we must ensure we emit the right encoding for those contexts as well. [2] [1] http://unicode.org/faq/utf_bom.html#38 [2] http://unicode.org/faq/utf_bom.html#28
msg57033 - (view)	Author: James G. sack (jim) (jgsack)	Date: 2007-11-01 19:56
Adam Olsen wrote: > Adam Olsen added the comment: > > The problem with "being tolerate" as you suggest is you lose the ability > to round-trip. Read in a file using the UTF-8 signature, write it back > out, and suddenly nothing else can open it. I'm sorry, I don't see the round-trip problem you describe. If codec utf_8 or utf_8_sig were to accept input with or without the 3-byte BOM, and write it as currently specified without/with the BOM respectively, then _I_ can reread again with either utf_8 or utf_8_sig. No round trip problem _for me_. Now If I need to exchange with some else, that's a different matter. One way or another I need to know what format they need and create the output they require for their input. Am I missing something in your statement of a problem? > Conceptually, these signatures shouldn't even be part of the encoding; > they're a prefix in the file indicating which encoding to use. Yes, I'm aware of that, but you can't predict what you may find in dusty archives, or what someone may give to you. IMO, that's the basis of being tolerant in what you accept, is it not? > Note that the BOM signature (ZWNBSP) is a valid code point. Although it > seems unlikely for a file to start with ZWNBSP, if were to chop a file > up into smaller chunks and decode them individually you'd be more likely > to run into it. (However, it seems general use of ZWNBSP is being > discouraged precisely due to this potential for confusion[1]). I understand that throwing away a ZWNBSP at the beginning of a file does risk discarding data rather than metadata. I also believe the standards people recognized that and deliberately picked a BOM character that is a calculated low risk. I'm willing to accept that risk. > In summary, guessing the encoding should never be the default. Although > it may be appropriate in some contexts, we must ensure we emit the right > encoding for those contexts as well. [2] > > [1] http://unicode.org/faq/utf_bom.html#38 > [2] http://unicode.org/faq/utf_bom.html#28 From my point of view, I don't see that being tolerant in what _I_ (or my applications) accept violates any guidelines. Please explain where I am wrong. Regards, ..jim
msg57041 - (view)	Author: Adam Olsen (Rhamphoryncus)	Date: 2007-11-01 22:21
On 11/1/07, James G. sack (jim) <report@bugs.python.org> wrote: > > James G. sack (jim) added the comment: > > Adam Olsen wrote: > > Adam Olsen added the comment: > > > > The problem with "being tolerate" as you suggest is you lose the ability > > to round-trip. Read in a file using the UTF-8 signature, write it back > > out, and suddenly nothing else can open it. > > I'm sorry, I don't see the round-trip problem you describe. > > If codec utf_8 or utf_8_sig were to accept input with or without the > 3-byte BOM, and write it as currently specified without/with the BOM > respectively, then _I_ can reread again with either utf_8 or utf_8_sig. > > No round trip problem _for me_. > > Now If I need to exchange with some else, that's a different matter. One > way or another I need to know what format they need and create the > output they require for their input. > > Am I missing something in your statement of a problem? You don't seem to think it's important to interact with other programs. If you're importing with no intent to write out to a common format, then yes, autodetecting the BOM is just fine. Python needs a more general default though, and not guessing is part of that. > > Conceptually, these signatures shouldn't even be part of the encoding; > > they're a prefix in the file indicating which encoding to use. > > Yes, I'm aware of that, but you can't predict what you may find in dusty > archives, or what someone may give to you. IMO, that's the basis of > being tolerant in what you accept, is it not? Garbage in, garbage out. There's a lot of protocols with whitespace, capitalization, etc that you can fudge around while retaining the same contents; character set encodings aren't one of them.
msg57522 - (view)	Author: James G. sack (jim) (jgsack)	Date: 2007-11-15 08:40
re: msg57041, I'm sorry if I gave the wrong impression about interacting with other programs. I started this feature request with some half-baked thinking, which I tried to revise in my second post. Anyway I'm most interested right now in lobbying for a change to utf_8 to accept input with an _optional_ BOM-signature so that the input part would behave just like utf_8_sig, where the BOM-sig is already optional (on input). In the process of trying to come up with a test and patch for this, I discovered a bug in utf_8_sig (issue #1444 http://bugs.python.org/ issue1444). After there is some action on that I will return here to continue with utf_8, which I have convinced myself (anyways) is a reasonable and safe revision. ..jim
msg57527 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2007-11-15 12:57
jgsack wrote: > > If codec utf_8 or utf_8_sig were to accept input with or without the > 3-byte BOM, and write it as currently specified without/with the BOM > respectively, then _I_ can reread again with either utf_8 or utf_8_sig. That's exactly what the utf_8_sig codec does. The decoder accepts input with or without the BOM (the (first) BOM doesn't get returned). The encoder always prepends a BOM. Or do you want a codec that behaves like utf_8 on reading and like utf_8_sig on writing? Such a codec indead indead wouldn't roundtrip.
msg57529 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2007-11-15 13:41
> For utf16, (arguably) a missing BOM should merely assume machian endianess. > For utf_16_le, utf_16_be input, both should accept & discard a BOM. > On output, I'm not sure; maybe all should write a BOM unless passed a flag > signifying no bom? > Or to preserve backward compat, could have a parm write_bom defaulting to > True for utf16 and False for utf_16_le and utf_16_be. This is a > modification of the originial request (for a force_bom flag). The Unicode FAQ (http://unicode.org/faq/utf_bom.html#28) clearly states: """ Q: How I should deal with BOMs? [...] Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used. [...]
msg57691 - (view)	Author: James G. sack (jim) (jgsack)	Date: 2007-11-20 03:39
More discussion of utf_8.py decoding behavior (and possible change): For my needs, I would like the decoding parts of the utf_8 module to treat an initial BOM as an optional signature and skip it if there is one (just like the utf_8_sig decoder). In fact I have a working patch that replaces the utf_8_sig decode, IncrementalDecoder and StreamReader components by direct transplants from utf_8_sig (as recently repaired -- there was a SteamReader error). However the reason for discussion is to ask how it might impact existing code. I can imagine there might be utf_8 client code out there which expects to see a leading U+feff as (perhaps) a clue that the output should be returned with a BOM-signature (say) to accomodate the guessed input requirements of the remote correspondant. Making my work easier might actually make someone else's work (probably, annoyingly) harder. So what to do? I can just live with code like if input[0] == u"\ufeff": input=input[1:} spread around, and of course slightly different for incremental and stream inputs. But I probably wouldn't. I would probably substitute a "my_utf_8" encoding for to make my code a little cleaner. Another thought I had would require "the other guy" to update his code, but at least it wouldn't make his work annoyingly difficult like my original change might have. Here's the basic outline: - Add another decoder function that returns a 3-tuple decode3(input, errors='strict') => (data, consumed, had_bom) where had_bom is true if a leading bom was seen and skipped - then the usual decode is just something like def decode(input, errors='strict'): return decode3(input, errors)[:2] - add member variable and accessor to both IncrementalDecoder and StreamReader classes something like def had_bom(self): return self.had_bom and initialize/set the self.had_bom variable as required. This complicates the interface somewhat and requires some additional documantation. Tpo document my original simple [-minded] idea required possibly only a few more words in the existing paragraph on utf_8_sig, to mention that both mods had the same decoding behavior but different encoding. I thought of a secondary consideration: If utf_8 and utf_8_sig are "almost the same", it's possible that future refactoring might unify them with differences contained in behavor-flags (eg, skip_leading_bom). The leading bom processing might even be pushed into codecs.utf_8_decode for possible minor advantages. Is there anybody monitoring this who has an opinion on this? ..jim
msg63705 - (view)	Author: Sean Reifschneider (jafo) *	Date: 2008-03-17 18:18
It sounds like the Unicode FAQ has an authoritative statement on this, is this a "wontfix", or does this need more discussion? Perhaps on python-dev or at the sprints this week?
msg64189 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2008-03-20 18:16
I don't see exactly what James is proposing. > For my needs, I would like the decoding parts of the utf_8 module > to treat an initial BOM as an optional signature and skip it if > there is one (just like the utf_8_sig decoder). In fact I have > a working patch that replaces the utf_8_sig decode, > IncrementalDecoder and StreamReader components by direct > transplants from utf_8_sig (as recently repaired -- there was a > SteamReader error). I've you want a decoder that behave like the utf-8-sig decoder, use the utf-8-sig decoder. I don't see how changing the utf-8 decoder helps here. > I can imagine there might be utf_8 client code out there which > expects to see a leading U+feff as (perhaps) a clue that the > output should be returned with a BOM-signature (say) to > accomodate the guessed input requirements of the remote > correspondant. In this case use UTF-8: The leading BOM will be passed to the application. > I can just live with code like > if input[0] == u"\ufeff": > input=input[1:} > spread around, and of course slightly different for incremental > and stream inputs. Can you post an example that requires this code?
msg64217 - (view)	Author: James G. sack (jim) (jgsack)	Date: 2008-03-20 22:21
> Can you post an example that requires this code? This is not a big issue, and it wouldn't hurt if it got declared "go away and come back later if you have patch, test, docs, and a convincing use case". ..But, for the record.. Suppose I want to both read and write some utf8. It is unknown whether the input has a BOM, but it is known to be utf8. I want to write utf8 without any BOM. I see two options, which I find slightly ugly/annoying/error-prone: a) Use 2 separate encodings: read via utf_8_sig so as to transparently accept input with/without BOM; use utf_8 on output to not emit any BOM. b) Use utf_8 for read and write and explicitly check for and discard leading BOM on input if any. What _I_ would prefer is that utf_8 would ignore a BOM, if present (just like utf_8_sig). (What I was talking about in my last post was a complication in consideration of someone else who would prefer otherwise, or of code that might break upon my change.) Regards, ..jim
msg64324 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2008-03-22 14:42
If you want to use UTF-8-sig for decoding and UTF-8 for encoding and have this available as one codec you can define your owen codec for this: import codecs def search_function(name): if name == "myutf8": utf8 = codecs.lookup("utf-8") utf8_sig = codecs.lookup("utf-8-sig") return codecs.CodecInfo( name='myutf8', encode=utf8.encode, decode=utf8_sig.decode, incrementalencoder=utf8.IncrementalEncoder, incrementaldecoder=utf8_sig.IncrementalDecoder, streamreader=utf8_sig.StreamReader, streamwriter=utf8.StreamWriter, ) codecs.register(search_function) Closing the issue as "wont fix"
msg64325 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2008-03-22 14:44
Oops, that code was supposed to read: import codecs def search_function(name): if name == "myutf8": utf8 = codecs.lookup("utf-8") utf8_sig = codecs.lookup("utf-8-sig") return codecs.CodecInfo( name='myutf8', encode=utf8.encode, decode=utf8_sig.decode, incrementalencoder=utf8.incrementalencoder, incrementaldecoder=utf8_sig.incrementaldecoder, streamreader=utf8_sig.streamreader, streamwriter=utf8.streamwriter, ) codecs.register(search_function)

History
Date	User	Action	Args
2022-04-11 14:56:27	admin	set	github: 45669
2008-03-22 14:44:52	doerwalter	set	messages: + msg64325
2008-03-22 14:42:01	doerwalter	set	status: open -> closed resolution: wont fix messages: + msg64324
2008-03-20 22:21:41	jgsack	set	messages: + msg64217
2008-03-20 18:16:12	doerwalter	set	messages: + msg64189
2008-03-17 18:18:56	jafo	set	title: feature request: force BOM option -> Force BOM option in UTF output. nosy: + jafo messages: + msg63705 priority: normal assignee: doerwalter type: behavior -> enhancement
2007-11-20 22:23:45	gvanrossum	set	nosy: - gvanrossum
2007-11-20 03:39:57	jgsack	set	messages: + msg57691 versions: + Python 2.6
2007-11-15 13:41:57	doerwalter	set	messages: + msg57529
2007-11-15 12:57:09	doerwalter	set	nosy: + doerwalter messages: + msg57527
2007-11-15 08:40:50	jgsack	set	messages: + msg57522
2007-11-01 22:21:34	Rhamphoryncus	set	messages: + msg57041
2007-11-01 19:56:30	jgsack	set	messages: + msg57033
2007-11-01 19:07:38	Rhamphoryncus	set	nosy: + Rhamphoryncus messages: + msg57028
2007-10-28 01:40:09	ggenellina	set	nosy: + ggenellina
2007-10-26 19:54:12	jgsack	set	messages: + msg56817
2007-10-26 19:36:45	gvanrossum	set	messages: + msg56814
2007-10-26 19:34:47	jgsack	set	messages: + msg56813
2007-10-26 17:44:31	gvanrossum	set	nosy: + gvanrossum messages: + msg56801
2007-10-26 06:43:59	jgsack	set	messages: + msg56782
2007-10-26 06:33:33	jgsack	set	messages: + msg56780
2007-10-25 22:59:36	jgsack	create