Message 56780 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	jgsack
Recipients	jgsack
Date	2007-10-26.06:33:32
SpamBayes Score	0.030982068
Marked as misclassified	No
Message-id	<1193380413.29.0.984877318762.issue1328@psf.upfronthosting.co.za>
In-reply-to

Content
Feature Request REVISION ======================== Upon reflection and more playing around with some test cases, I wish to revise my feature request. I think the utf8 codecs should accept input with or without the "sig". On output, only the utf_8_sig should write the 3-byte "sig". This behavior change would not seem disruptive to current applications. For utf16, (arguably) a missing BOM should merely assume machian endianess. For utf_16_le, utf_16_be input, both should accept & discard a BOM. On output, I'm not sure; maybe all should write a BOM unless passed a flag signifying no bom? Or to preserve backward compat, could have a parm write_bom defaulting to True for utf16 and False for utf_16_le and utf_16_be. This is a modification of the originial request (for a force_bom flag). Unless I have confused myself with my test cases, the current codecs are slightly inconsistent for the utf8 codecs: utf8 treats "sig" as real data, if present, but.. utf_8_sig works right even without the "sig" (so this one I like as is!) The 16'ers seem to match the (inferred) specs, but for completeness here: utf_16 refuses to proceed w/o BOM (even with correct endian input data) utf_16_le treats BOM as data utf_16_be treats BOM as data Regards, ..jim

Feature Request REVISION
========================
Upon reflection and more playing around with some test cases, I wish to 
revise my feature request.

I think the utf8 codecs should accept input with or without the "sig".
On output, only the utf_8_sig should write the 3-byte "sig". This behavior 
change would not seem disruptive to current applications. 

For utf16, (arguably) a missing BOM should merely assume machian endianess.
For utf_16_le, utf_16_be input, both should accept & discard a BOM.
On output, I'm not sure; maybe all should write a BOM unless passed a flag 
signifying no bom? 
Or to preserve backward compat, could have a parm write_bom defaulting to 
True for utf16 and False for utf_16_le and utf_16_be. This is a 
modification of the originial request (for a force_bom flag).  

Unless I have confused myself with my test cases, the current codecs are 
slightly inconsistent for the utf8 codecs:

utf8 treats "sig" as real data, if present, but..
utf_8_sig works right even without the "sig" (so this one I like as is!)

The 16'ers seem to match the (inferred) specs, but for completeness here:
utf_16 refuses to proceed w/o BOM (even with correct endian input data)
utf_16_le treats BOM as data
utf_16_be treats BOM as data

Regards,
..jim

History
Date	User	Action	Args
2007-10-26 06:33:33	jgsack	set	spambayes_score: 0.0309821 -> 0.030982068 recipients: + jgsack
2007-10-26 06:33:33	jgsack	set	spambayes_score: 0.0309821 -> 0.0309821 messageid: <1193380413.29.0.984877318762.issue1328@psf.upfronthosting.co.za>
2007-10-26 06:33:33	jgsack	link	issue1328 messages
2007-10-26 06:33:32	jgsack	create