Message 57691 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	jgsack
Recipients	Rhamphoryncus, doerwalter, ggenellina, gvanrossum, jgsack
Date	2007-11-20.03:39:55
SpamBayes Score	0.013067013
Marked as misclassified	No
Message-id	<1195529997.85.0.0335207530675.issue1328@psf.upfronthosting.co.za>
In-reply-to

Content
More discussion of utf_8.py decoding behavior (and possible change): For my needs, I would like the decoding parts of the utf_8 module to treat an initial BOM as an optional signature and skip it if there is one (just like the utf_8_sig decoder). In fact I have a working patch that replaces the utf_8_sig decode, IncrementalDecoder and StreamReader components by direct transplants from utf_8_sig (as recently repaired -- there was a SteamReader error). However the reason for discussion is to ask how it might impact existing code. I can imagine there might be utf_8 client code out there which expects to see a leading U+feff as (perhaps) a clue that the output should be returned with a BOM-signature (say) to accomodate the guessed input requirements of the remote correspondant. Making my work easier might actually make someone else's work (probably, annoyingly) harder. So what to do? I can just live with code like if input[0] == u"\ufeff": input=input[1:} spread around, and of course slightly different for incremental and stream inputs. But I probably wouldn't. I would probably substitute a "my_utf_8" encoding for to make my code a little cleaner. Another thought I had would require "the other guy" to update his code, but at least it wouldn't make his work annoyingly difficult like my original change might have. Here's the basic outline: - Add another decoder function that returns a 3-tuple decode3(input, errors='strict') => (data, consumed, had_bom) where had_bom is true if a leading bom was seen and skipped - then the usual decode is just something like def decode(input, errors='strict'): return decode3(input, errors)[:2] - add member variable and accessor to both IncrementalDecoder and StreamReader classes something like def had_bom(self): return self.had_bom and initialize/set the self.had_bom variable as required. This complicates the interface somewhat and requires some additional documantation. Tpo document my original simple [-minded] idea required possibly only a few more words in the existing paragraph on utf_8_sig, to mention that both mods had the same decoding behavior but different encoding. I thought of a secondary consideration: If utf_8 and utf_8_sig are "almost the same", it's possible that future refactoring might unify them with differences contained in behavor-flags (eg, skip_leading_bom). The leading bom processing might even be pushed into codecs.utf_8_decode for possible minor advantages. Is there anybody monitoring this who has an opinion on this? ..jim

More discussion of utf_8.py decoding behavior (and possible change):

For my needs, I would like the decoding parts of the utf_8 module to treat 
an initial BOM as an optional signature and skip it if there is one (just 
like the utf_8_sig decoder). In fact I have a working patch that replaces 
the utf_8_sig  decode, IncrementalDecoder and StreamReader components by 
direct transplants from utf_8_sig (as recently repaired -- there was a 
SteamReader error).

However the reason for discussion is to ask how it might impact existing 
code.

I can imagine there might be utf_8 client code out there which expects to 
see a leading U+feff as (perhaps) a clue that the output should be returned 
with a BOM-signature (say) to accomodate the guessed input requirements of 
the remote correspondant.

Making my work easier might actually make someone else's work (probably, 
annoyingly) harder. 

So what to do?

I can just live with code like
  if input[0] == u"\ufeff": 
    input=input[1:}
spread around, and of course slightly different for incremental and stream 
inputs. 
  
  But I probably wouldn't. I would probably substitute a
  "my_utf_8" encoding for to make my code a little cleaner.

Another thought I had would require "the other guy" to update his code, but 
at least it wouldn't make his work annoyingly difficult like my original 
change might have.

Here's the basic outline:

- Add another decoder function that returns a 3-tuple
  decode3(input, errors='strict') => (data, consumed, had_bom)
where had_bom is true if a leading bom was seen and skipped

- then the usual decode is just something like
  def decode(input, errors='strict'):
    return decode3(input, errors)[:2]

- add member variable and accessor to both IncrementalDecoder and 
StreamReader classes something like
  def had_bom(self):
    return self.had_bom
and initialize/set the self.had_bom variable as required.

This complicates the interface somewhat and requires some additional 
documantation.

   Tpo document my original simple [-minded] idea required 
   possibly only a few more words in the existing paragraph
   on utf_8_sig, to mention that both mods had the same 
   decoding behavior but different encoding.

I thought of a secondary consideration: If utf_8 and utf_8_sig are "almost 
the same", it's possible that future refactoring might unify them with 
differences contained in behavor-flags (eg, skip_leading_bom). The leading 
bom processing might even be pushed into codecs.utf_8_decode for possible 
minor advantages. 

Is there anybody monitoring this who has an opinion on this? 

..jim

History
Date	User	Action	Args
2007-11-20 03:39:58	jgsack	set	spambayes_score: 0.013067 -> 0.013067013 recipients: + jgsack, gvanrossum, doerwalter, ggenellina, Rhamphoryncus
2007-11-20 03:39:57	jgsack	set	spambayes_score: 0.013067 -> 0.013067 messageid: <1195529997.85.0.0335207530675.issue1328@psf.upfronthosting.co.za>
2007-11-20 03:39:57	jgsack	link	issue1328 messages
2007-11-20 03:39:55	jgsack	create