Message 137902 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	amaury.forgeotdarc, loewis, vstinner
Date	2011-06-08.12:47:43
SpamBayes Score	6.9954154e-08
Marked as misclassified	No
Message-id	<1307537264.97.0.50726648981.issue12281@psf.upfronthosting.co.za>
In-reply-to

Content
mbcs.patch fixes PyUnicode_DecodeMBCS(): - only use flags=0 if errors="replace" on Windows >= Vista or if errors="ignore" on Windows < Vista - support any error handler - support any code page (but the code page is hardcoded to CP_ACP) My patch always tries to decode in strict mode. On decode error: it decodes byte per byte, and call unicode_decode_call_errorhandler() on error. TODO: - don't use insize=1 (decode byte per byte): it doesn't work with multibyte encodings (like UTF-8) - use final in decode_mbcs_errors(): a multibyte character may be splitted between two chunks of INT_MAX bytes - fix all FIXME - patch also PyUnicode_EncodeMBCS() - implement suggested Martin's optimizations? - MB_ERR_INVALID_CHARS is not supported by some code pages (e.g. UTF-7 code page) Is it necessary to write a NUL character at the end? ("*out = 0;") It would be nice to support any code page, and maybe support more options (e.g. MB_COMPOSITE, MB_PRECOMPOSED, MB_USEGLYPHCHARS to decode). It is possible to test different code pages by changing the hardcoded code_page value in PyUnicode_DecodeMBCS. Change your region in the control panel if you would like to change the Windows ANSI code page. You can also play with SetThreadLocale() and CP_THREAD_ACP to test the ANSI code page of the current thread.

mbcs.patch fixes PyUnicode_DecodeMBCS():
 - only use flags=0 if errors="replace" on Windows >= Vista or if errors="ignore" on Windows < Vista
 - support any error handler
 - support any code page (but the code page is hardcoded to CP_ACP)

My patch always tries to decode in strict mode. On decode error: it decodes byte per byte, and call unicode_decode_call_errorhandler() on error.

TODO:

 - don't use insize=1 (decode byte per byte): it doesn't work with multibyte encodings (like UTF-8)
 - use final in decode_mbcs_errors(): a multibyte character may be splitted between two chunks of INT_MAX bytes
 - fix all FIXME
 - patch also PyUnicode_EncodeMBCS()
 - implement suggested Martin's optimizations?
 - MB_ERR_INVALID_CHARS is not supported by some code pages (e.g. UTF-7 code page)

Is it necessary to write a NUL character at the end? ("*out = 0;")

It would be nice to support any code page, and maybe support more options (e.g. MB_COMPOSITE, MB_PRECOMPOSED, MB_USEGLYPHCHARS to decode).

It is possible to test different code pages by changing the hardcoded code_page value in PyUnicode_DecodeMBCS. Change your region in the control panel if you would like to change the Windows ANSI code page. You can also play with SetThreadLocale() and CP_THREAD_ACP to test the ANSI code page of the current thread.

History
Date	User	Action	Args
2011-06-08 12:47:45	vstinner	set	recipients: + vstinner, loewis, amaury.forgeotdarc
2011-06-08 12:47:44	vstinner	set	messageid: <1307537264.97.0.50726648981.issue12281@psf.upfronthosting.co.za>
2011-06-08 12:47:44	vstinner	link	issue12281 messages
2011-06-08 12:47:44	vstinner	create