Message 84115 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	christian.heimes, georg.brandl, loewis, tungwaiyip, vstinner
Date	2009-03-24.22:00:02
SpamBayes Score	1.313949e-13
Marked as misclassified	No
Message-id	<1237932017.68.0.639132292951.issue1503789@psf.upfronthosting.co.za>
In-reply-to

Content
Attached patch is a partial fix: support UTF-16-LE, UTF-16-BE and UTF-32-LE. Some remarks about my patch: * UTF-32-BE is not supported because I'm too lazy tonigh to finish the patch and because such file begins with 0x00 0x00 whereas the parser doesn't like nul bytes * I disabled the cookie check if the file starts with a BOM (the cookie is ignored) because the charset name is not normalized and so if the cookie is not exactly the same as the hardcoded charset name (eg. "UTF-16LE"), the test will fail. Eg "utf-16le" != "UTF-16LE" :-( * compile() would require much more effort to support UTF-16-* and UTF-32-* because compile() simply rejects any string with nul byte. It's beause it uses functions like strlen() :-/ That's why I use subprocess([sys.executable, ...]) in the unit test and not simply compile() Support UTF-{16,32}-{LE,BE} would be nice but it requires to hack to parser (especially compile() builtin function) to support nul bytes...

Attached patch is a partial fix: support UTF-16-LE, UTF-16-BE and 
UTF-32-LE. Some remarks about my patch:
 * UTF-32-BE is not supported because I'm too lazy tonigh 
   to finish the patch and because such file begins with 0x00 0x00
   whereas the parser doesn't like nul bytes
 * I disabled the cookie check if the file starts with a BOM (the
   cookie is ignored) because the charset name is not normalized
   and so if the cookie is not exactly the same as the hardcoded
   charset name (eg. "UTF-16LE"), the test will fail. 
   Eg "utf-16le" != "UTF-16LE" :-(
 * compile() would require much more effort to support UTF-16-* 
   and UTF-32-* because compile() simply rejects any string with 
   nul byte. It's beause it uses functions like strlen() :-/ That's
   why I use subprocess([sys.executable, ...]) in the unit test and
   not simply compile()

Support UTF-{16,32}-{LE,BE} would be nice but it requires to hack to 
parser (especially compile() builtin function) to support nul bytes...

History
Date	User	Action	Args
2009-03-24 22:00:17	vstinner	set	recipients: + vstinner, loewis, georg.brandl, tungwaiyip, christian.heimes
2009-03-24 22:00:17	vstinner	set	messageid: <1237932017.68.0.639132292951.issue1503789@psf.upfronthosting.co.za>
2009-03-24 22:00:15	vstinner	link	issue1503789 messages
2009-03-24 22:00:13	vstinner	create