This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author jgsack
Recipients Rhamphoryncus, doerwalter, ggenellina, gvanrossum, jgsack
Date 2007-11-20.03:39:55
SpamBayes Score 0.013067013
Marked as misclassified No
Message-id <1195529997.85.0.0335207530675.issue1328@psf.upfronthosting.co.za>
In-reply-to
Content
More discussion of utf_8.py decoding behavior (and possible change):

For my needs, I would like the decoding parts of the utf_8 module to treat 
an initial BOM as an optional signature and skip it if there is one (just 
like the utf_8_sig decoder). In fact I have a working patch that replaces 
the utf_8_sig  decode, IncrementalDecoder and StreamReader components by 
direct transplants from utf_8_sig (as recently repaired -- there was a 
SteamReader error).

However the reason for discussion is to ask how it might impact existing 
code.

I can imagine there might be utf_8 client code out there which expects to 
see a leading U+feff as (perhaps) a clue that the output should be returned 
with a BOM-signature (say) to accomodate the guessed input requirements of 
the remote correspondant.

Making my work easier might actually make someone else's work (probably, 
annoyingly) harder. 

So what to do?

I can just live with code like
  if input[0] == u"\ufeff": 
    input=input[1:}
spread around, and of course slightly different for incremental and stream 
inputs. 
  
  But I probably wouldn't. I would probably substitute a
  "my_utf_8" encoding for to make my code a little cleaner.

Another thought I had would require "the other guy" to update his code, but 
at least it wouldn't make his work annoyingly difficult like my original 
change might have.

Here's the basic outline:

- Add another decoder function that returns a 3-tuple
  decode3(input, errors='strict') => (data, consumed, had_bom)
where had_bom is true if a leading bom was seen and skipped

- then the usual decode is just something like
  def decode(input, errors='strict'):
    return decode3(input, errors)[:2]

- add member variable and accessor to both IncrementalDecoder and 
StreamReader classes something like
  def had_bom(self):
    return self.had_bom
and initialize/set the self.had_bom variable as required.

This complicates the interface somewhat and requires some additional 
documantation.

   Tpo document my original simple [-minded] idea required 
   possibly only a few more words in the existing paragraph
   on utf_8_sig, to mention that both mods had the same 
   decoding behavior but different encoding.

I thought of a secondary consideration: If utf_8 and utf_8_sig are "almost 
the same", it's possible that future refactoring might unify them with 
differences contained in behavor-flags (eg, skip_leading_bom). The leading 
bom processing might even be pushed into codecs.utf_8_decode for possible 
minor advantages. 

Is there anybody monitoring this who has an opinion on this? 

..jim
History
Date User Action Args
2007-11-20 03:39:58jgsacksetspambayes_score: 0.013067 -> 0.013067013
recipients: + jgsack, gvanrossum, doerwalter, ggenellina, Rhamphoryncus
2007-11-20 03:39:57jgsacksetspambayes_score: 0.013067 -> 0.013067
messageid: <1195529997.85.0.0335207530675.issue1328@psf.upfronthosting.co.za>
2007-11-20 03:39:57jgsacklinkissue1328 messages
2007-11-20 03:39:55jgsackcreate