Message 116001 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	loewis, vstinner
Date	2010-09-10.11:04:38
SpamBayes Score	5.551115e-17
Marked as misclassified	No
Message-id	<1284116681.52.0.20336510367.issue9821@psf.upfronthosting.co.za>
In-reply-to

Content
It would be nice to support PEP 383 (surrogateescape) on Windows, but the mbcs codec doesn't support it for performance reason. The Windows functions to encode/decode MBCS don't give the index of the unencodable/undecodable character/byte. For encoding, we can try to encode character by character (but be careful of surrogate pairs) and check that the character is a Python lone surrogate character or not (character in range U+DC80..U+DCFF). For decoding, it is more complex because MBCS can be a multibyte encoding, eg. cp65001 (Microsoft variant of utf-8, see #6058). So it's not possible to encode byte per byte and we should write an heuristic to guess the right number of bytes for each call to the decode function. -- A completly different solution is to get the MBCS code page and use the Python code page codec (eg. "cp1252") instead of "mbcs" encoding, because Python cpXXXX codecs support all Python error handlers. Example (with Python 2.6): >>> print(u"abcŁdef".encode("cp1252", "replace")) abc?def >>> print(u"abcŁdef".encode("cp1252", "ignore")) abcdef >>> print(u"abcŁdef".encode("cp1252", "backslashreplace")) abc\u0141def See also #8611 for the problem if the Python path cannot be encoded to mbcs (work in progress, see #9425).

It would be nice to support PEP 383 (surrogateescape) on Windows, but the mbcs codec doesn't support it for performance reason. The Windows functions to encode/decode MBCS don't give the index of the unencodable/undecodable character/byte. For encoding, we can try to encode character by character (but be careful of surrogate pairs) and check that the character is a Python lone surrogate character or not (character in range U+DC80..U+DCFF). For decoding, it is more complex because MBCS can be a multibyte encoding, eg. cp65001 (Microsoft variant of utf-8, see #6058). So it's not possible to encode byte per byte and we should write an heuristic to guess the right number of bytes for each call to the decode function.

--

A completly different solution is to get the MBCS code page and use the Python code page codec (eg. "cp1252") instead of "mbcs" encoding, because Python cpXXXX codecs support all Python error handlers. Example (with Python 2.6):

>>> print(u"abcŁdef".encode("cp1252", "replace"))
abc?def
>>> print(u"abcŁdef".encode("cp1252", "ignore"))
abcdef
>>> print(u"abcŁdef".encode("cp1252", "backslashreplace"))
abc\u0141def

See also #8611 for the problem if the Python path cannot be encoded to mbcs (work in progress, see #9425).

History
Date	User	Action	Args
2010-09-10 11:04:41	vstinner	set	recipients: + vstinner, loewis
2010-09-10 11:04:41	vstinner	set	messageid: <1284116681.52.0.20336510367.issue9821@psf.upfronthosting.co.za>
2010-09-10 11:04:40	vstinner	link	issue9821 messages
2010-09-10 11:04:38	vstinner	create