This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author jgsack
Recipients jgsack
Date 2007-10-26.06:33:32
SpamBayes Score 0.030982068
Marked as misclassified No
Message-id <1193380413.29.0.984877318762.issue1328@psf.upfronthosting.co.za>
In-reply-to
Content
Feature Request REVISION
========================
Upon reflection and more playing around with some test cases, I wish to 
revise my feature request.

I think the utf8 codecs should accept input with or without the "sig".
On output, only the utf_8_sig should write the 3-byte "sig". This behavior 
change would not seem disruptive to current applications. 

For utf16, (arguably) a missing BOM should merely assume machian endianess.
For utf_16_le, utf_16_be input, both should accept & discard a BOM.
On output, I'm not sure; maybe all should write a BOM unless passed a flag 
signifying no bom? 
Or to preserve backward compat, could have a parm write_bom defaulting to 
True for utf16 and False for utf_16_le and utf_16_be. This is a 
modification of the originial request (for a force_bom flag).  

Unless I have confused myself with my test cases, the current codecs are 
slightly inconsistent for the utf8 codecs:

utf8 treats "sig" as real data, if present, but..
utf_8_sig works right even without the "sig" (so this one I like as is!)

The 16'ers seem to match the (inferred) specs, but for completeness here:
utf_16 refuses to proceed w/o BOM (even with correct endian input data)
utf_16_le treats BOM as data
utf_16_be treats BOM as data

Regards,
..jim
History
Date User Action Args
2007-10-26 06:33:33jgsacksetspambayes_score: 0.0309821 -> 0.030982068
recipients: + jgsack
2007-10-26 06:33:33jgsacksetspambayes_score: 0.0309821 -> 0.0309821
messageid: <1193380413.29.0.984877318762.issue1328@psf.upfronthosting.co.za>
2007-10-26 06:33:33jgsacklinkissue1328 messages
2007-10-26 06:33:32jgsackcreate