This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Support PEP 383 on Windows: mbcs support of surrogateescape error handler
Type: Stage:
Components: Interpreter Core, Library (Lib), Unicode, Windows Versions: Python 3.2
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: lemburg, loewis, vstinner
Priority: normal Keywords:

Created on 2010-09-10 11:04 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (4)
msg116001 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-10 11:04
It would be nice to support PEP 383 (surrogateescape) on Windows, but the mbcs codec doesn't support it for performance reason. The Windows functions to encode/decode MBCS don't give the index of the unencodable/undecodable character/byte. For encoding, we can try to encode character by character (but be careful of surrogate pairs) and check that the character is a Python lone surrogate character or not (character in range U+DC80..U+DCFF). For decoding, it is more complex because MBCS can be a multibyte encoding, eg. cp65001 (Microsoft variant of utf-8, see #6058). So it's not possible to encode byte per byte and we should write an heuristic to guess the right number of bytes for each call to the decode function.

--

A completly different solution is to get the MBCS code page and use the Python code page codec (eg. "cp1252") instead of "mbcs" encoding, because Python cpXXXX codecs support all Python error handlers. Example (with Python 2.6):

>>> print(u"abcŁdef".encode("cp1252", "replace"))
abc?def
>>> print(u"abcŁdef".encode("cp1252", "ignore"))
abcdef
>>> print(u"abcŁdef".encode("cp1252", "backslashreplace"))
abc\u0141def

See also #8611 for the problem if the Python path cannot be encoded to mbcs (work in progress, see #9425).
msg116006 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-09-10 11:14
STINNER Victor wrote:
> 
> New submission from STINNER Victor <victor.stinner@haypocalc.com>:
> 
> It would be nice to support PEP 383 (surrogateescape) on Windows, but the mbcs codec doesn't support it for performance reason. The Windows functions to encode/decode MBCS don't give the index of the unencodable/undecodable character/byte. For encoding, we can try to encode character by character (but be careful of surrogate pairs) and check that the character is a Python lone surrogate character or not (character in range U+DC80..U+DCFF). For decoding, it is more complex because MBCS can be a multibyte encoding, eg. cp65001 (Microsoft variant of utf-8, see #6058). So it's not possible to encode byte per byte and we should write an heuristic to guess the right number of bytes for each call to the decode function.
> 
> --
> 
> A completly different solution is to get the MBCS code page and use the Python code page codec (eg. "cp1252") instead of "mbcs" encoding, because Python cpXXXX codecs support all Python error handlers. Example (with Python 2.6):
> 
>>>> print(u"abcŁdef".encode("cp1252", "replace"))
> abc?def
>>>> print(u"abcŁdef".encode("cp1252", "ignore"))
> abcdef
>>>> print(u"abcŁdef".encode("cp1252", "backslashreplace"))
> abc\u0141def

That would certainly be a better approach, provided that our
cp-encodings are indeed compatible with the Windows variants
(which unfortunately tend to often use slightly different
mappings).

We could then also alias 'mbcs' to the cp-encoding (sort of
like the reverse of what we do in site.py:aliasmbcs().
msg116011 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-10 12:16
Oh wait. PEP 383 is a solution to store undecodable bytes in an unicode string, but for mbcs I'm trying to get the opposite: store unicode in bytes and this is not possible (at least with PEP 383).

Example with Python 3.1:

>>> print("abcŁdef".encode("cp1252", "surrogateescape"))
...
UnicodeEncodeError: 'charmap' codec can't encode character '\u0141' in position 3: character maps to <undefined>
msg116044 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-10 21:36
Close this issue: PEP 383 is specific to filesystem using bytes, it is useless on Windows (the problem on Windows is on encoding, not on decoding).
History
Date User Action Args
2022-04-11 14:57:06adminsetgithub: 54030
2010-09-10 21:36:45vstinnersetstatus: open -> closed
resolution: not a bug
messages: + msg116044
2010-09-10 12:16:20vstinnersetmessages: + msg116011
2010-09-10 11:14:59lemburgsetnosy: + lemburg
title: Support PEP 383 on Windows: mbcs support of surrogateescape error handler -> Support PEP 383 on Windows: mbcs support of surrogateescape error handler
messages: + msg116006
2010-09-10 11:04:40vstinnercreate