Message56780
Feature Request REVISION
========================
Upon reflection and more playing around with some test cases, I wish to
revise my feature request.
I think the utf8 codecs should accept input with or without the "sig".
On output, only the utf_8_sig should write the 3-byte "sig". This behavior
change would not seem disruptive to current applications.
For utf16, (arguably) a missing BOM should merely assume machian endianess.
For utf_16_le, utf_16_be input, both should accept & discard a BOM.
On output, I'm not sure; maybe all should write a BOM unless passed a flag
signifying no bom?
Or to preserve backward compat, could have a parm write_bom defaulting to
True for utf16 and False for utf_16_le and utf_16_be. This is a
modification of the originial request (for a force_bom flag).
Unless I have confused myself with my test cases, the current codecs are
slightly inconsistent for the utf8 codecs:
utf8 treats "sig" as real data, if present, but..
utf_8_sig works right even without the "sig" (so this one I like as is!)
The 16'ers seem to match the (inferred) specs, but for completeness here:
utf_16 refuses to proceed w/o BOM (even with correct endian input data)
utf_16_le treats BOM as data
utf_16_be treats BOM as data
Regards,
..jim |
|
Date |
User |
Action |
Args |
2007-10-26 06:33:33 | jgsack | set | spambayes_score: 0.0309821 -> 0.030982068 recipients:
+ jgsack |
2007-10-26 06:33:33 | jgsack | set | spambayes_score: 0.0309821 -> 0.0309821 messageid: <1193380413.29.0.984877318762.issue1328@psf.upfronthosting.co.za> |
2007-10-26 06:33:33 | jgsack | link | issue1328 messages |
2007-10-26 06:33:32 | jgsack | create | |
|