Message 135772 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	belopolsky, ezio.melotti, georg.brandl, lemburg, moese, phr, vstinner
Date	2011-05-11.12:55:37
SpamBayes Score	0.0006877316
Marked as misclassified	No
Message-id	<1305118551.23.0.106349254941.issue2857@psf.upfronthosting.co.za>
In-reply-to

Content
utf_8_java.patch: Implement "utf-8-java" encoding. * It has no alias * 'a\0b'.encode('utf-8-java') returns b'a\xc0\x80b' * b'a\xc0\x80b'.decode('utf-8-java') returns 'a\x00b' * I added some tests to utf-8 codec (test_invalid, test_null_byte) * I added many tests for utf-8-java codec * I choosed to copy utf8_code_length as utf8java_code_length instead of adding some if to not slow down UTF-8 codec * Decoder: 2 byte sequences may be a little bit slower for UTF-8: "if ((s[1] & 0xc0) != 0x80)" is replaced by "if ((ch <= 0x007F && (ch != 0x0000 \|\| !java)) \|\| ch > 0x07FF)" * Encoder: encode chars in U+0000-U+007F may be a little bit slower for UTF-8: I added (ch == 0x00 && java) test For the doc, I just added a line "utf-8-java" in the codec list, but I did not add a paragraph to explain how this codec is different to utf-8. Does anyone have a suggestion?

utf_8_java.patch: Implement "utf-8-java" encoding.
 * It has no alias
 * 'a\0b'.encode('utf-8-java') returns b'a\xc0\x80b'
 * b'a\xc0\x80b'.decode('utf-8-java') returns 'a\x00b'
 * I added some tests to utf-8 codec (test_invalid, test_null_byte)
 * I added many tests for utf-8-java codec
 * I choosed to copy utf8_code_length as utf8java_code_length instead of adding some if to not slow down UTF-8 codec
 * Decoder: 2 byte sequences may be *a little bit* slower for UTF-8:
"if ((s[1] & 0xc0) != 0x80)"
   is replaced by 
"if ((ch <= 0x007F && (ch != 0x0000 || !java)) || ch > 0x07FF)"
 * Encoder: encode chars in U+0000-U+007F may be *a little bit* slower for UTF-8: I added (ch == 0x00 && java) test

For the doc, I just added a line "utf-8-java" in the codec list, but I did not add a paragraph to explain how this codec is different to utf-8. Does anyone have a suggestion?

History
Date	User	Action	Args
2011-05-11 12:55:51	vstinner	set	recipients: + vstinner, lemburg, georg.brandl, phr, belopolsky, moese, ezio.melotti
2011-05-11 12:55:51	vstinner	set	messageid: <1305118551.23.0.106349254941.issue2857@psf.upfronthosting.co.za>
2011-05-11 12:55:39	vstinner	link	issue2857 messages
2011-05-11 12:55:39	vstinner	create