This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: encodings: the "mbcs" alias doesn't work
Type: Stage: resolved
Components: Library (Lib) Versions: Python 3.11
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: eryksun, vstinner
Priority: normal Keywords: patch

Created on 2022-02-06 23:06 by vstinner, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 31174 closed vstinner, 2022-02-06 23:17
Messages (8)
msg412678 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-02-06 23:06
While working on bpo-46659, I found a bug in the encodings "mbcs" alias. Even if the function has 2 tests (in test_codecs and test_site), both tests missed the bug :-(

I fixed the alias with this change:
---
commit 04dd60e50cd3da48fd19cdab4c0e4cc600d6af30
Author: Victor Stinner <vstinner@python.org>
Date:   Sun Feb 6 21:50:09 2022 +0100

    bpo-46659: Update the test on the mbcs codec alias (GH-31168)
    
    encodings registers the _alias_mbcs() codec search function before
    the search_function() codec search function. Previously, the
    _alias_mbcs() was never used.
    
    Fix the test_codecs.test_mbcs_alias() test: use the current ANSI code
    page, not a fake ANSI code page number.
    
    Remove the test_site.test_aliasing_mbcs() test: the alias is now
    implemented in the encodings module, no longer in the site module.
---

But Eryk found two bugs:

"""


This was never true before. With 1252 as my ANSI code page, I checked codecs.lookup('cp1252') in 2.7, 3.4, 3.5, 3.6, 3.9, and 3.10, and none of them return the "mbcs" encoding. It's not equivalent, and not supposed to be. The implementation of "cp1252" should be cross-platform, regardless of whether we're on a Windows system with 1252 as the ANSI code page, as opposed to a Windows system with some other ANSI code page, or a Linux or macOS system.

The differences are that "mbcs" maps every byte, whereas our code-page encodings do not map undefined bytes, and the "replace" handler of "mbcs" uses a best-fit mapping (e.g. "α" -> "a") when encoding text, instead of mapping all undefined characters to "?".
"""

and my new test fails if PYTHONUTF8=1 env var is set:

"""
This will fail if PYTHONUTF8 is set in the environment, because it overrides getpreferredencoding(False) and _get_locale_encoding().
"""

The code for the "mbcs" alias changed at lot between Python 3.5 and 3.7.

In Python 3.5, site module:
---
def aliasmbcs():
    """On Windows, some default encodings are not provided by Python,
    while they are always available as "mbcs" in each locale. Make
    them usable by aliasing to "mbcs" in such a case."""
    if sys.platform == 'win32':
        import _bootlocale, codecs                        
        enc = _bootlocale.getpreferredencoding(False)
        if enc.startswith('cp'):            # "cp***" ?
            try:
                codecs.lookup(enc)
            except LookupError:
                import encodings
                encodings._cache[enc] = encodings._unknown
                encodings.aliases.aliases[enc] = 'mbcs'
---

In Python 3.6, encodings module:
---
(...)
codecs.register(search_function)

if sys.platform == 'win32':
    def _alias_mbcs(encoding):
        try:
            import _bootlocale
            if encoding == _bootlocale.getpreferredencoding(False):
                import encodings.mbcs
                return encodings.mbcs.getregentry()
        except ImportError:
            # Imports may fail while we are shutting down
            pass

    codecs.register(_alias_mbcs)
---

Python 3.7, encodings module:
---
(...)
codecs.register(search_function)

if sys.platform == 'win32':
    def _alias_mbcs(encoding):
        try:
            import _winapi
            ansi_code_page = "cp%s" % _winapi.GetACP()
            if encoding == ansi_code_page:
                import encodings.mbcs
                return encodings.mbcs.getregentry()
        except ImportError:
            # Imports may fail while we are shutting down
            pass

    codecs.register(_alias_mbcs)
---

The Python 3.6 and 3.7 "codecs.register(_alias_mbcs)" doesn't work because "search_function()" is tested before and it works for "cpXXX" encodings. My changes changes the order in which codecs search functions are registered: first the MBCS alias, then the encodings search_function().

In Python 3.5, the alias was only created if Python didn't support the code page.
msg412680 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-02-06 23:10
The alias was created in 2003 to fix bpo-671666:
---
commit 4eab486476c0082087a8460a5ab1064e64cc1a6b
Author: Martin v. Löwis <martin@v.loewis.de>
Date:   Mon Mar 3 09:34:01 2003 +0000

    Patch #671666: Alias ANSI code page to "mbcs".
---

In 2003, bpo-671666 was created because Python didn't support "cp932" encoding, whereas the MBCS codec was available and could used directly since cp932 was the ANSI code page.

The alias allows to support the ANSI code 932 without implement it.

But Python got a "cp932" codec the year after:
---
commit 3e2a30692085d32ac63f72b35da39158a471fc68
Author: Hye-Shik Chang <hyeshik@gmail.com>
Date:   Sat Jan 17 14:29:29 2004 +0000

    Add CJK codecs support as discussed on python-dev. (SF #873597)
    
    Several style fixes are suggested by Martin v. Loewis and
    Marc-Andre Lemburg. Thanks!
---
msg412683 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-02-06 23:13
Python 3.11 supports the 40 code pages:

* 037
* 273
* 424
* 437
* 500
* 720
* 737
* 775
* 850
* 852
* 855
* 856
* 857
* 858
* 860
* 861
* 862
* 863
* 864
* 865
* 866
* 869
* 874
* 875
* 932
* 949
* 950
* 1006
* 1026
* 1125
* 1140
* 1250
* 1251
* 1252
* 1253
* 1254
* 1255
* 1256
* 1257
* 1258
msg412691 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2022-02-07 00:10
> The Python 3.6 and 3.7 "codecs.register(_alias_mbcs)" doesn't work 
> because "search_function()" is tested before and it works for "cpXXX" 
> encodings.

Isn't the 3.6-3.10 ordering of search_function() and _alias_mbcs() correct as a fallback? In this case, Python doesn't support a cross-platform encoding for the code page. That's why the old implementation of test_mbcs_alias() mocked _winapi.GetACP() to return 123 and then checked that looking up 'cp123' returned the "mbcs" codec.

I'd actually prefer to extend this by implementing _winapi.GetOEMCP() and using "oem" as a fallback for that case.
msg412738 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-02-07 13:00
I don't think that this fallback is needed anymore. Which Windows code page can be used as ANSI code page which is not already implemented as a Python codec?
msg412777 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2022-02-07 17:53
> I don't think that this fallback is needed anymore. Which Windows
> code page can be used as ANSI code page which is not already 
> implemented as a Python codec?

Python has full coverage of the ANSI and OEM code pages in the standard Windows locales, but I don't have any experience with custom (i.e. supplemental or replacement) locales.

https://docs.microsoft.com/en-us/windows/win32/intl/custom-locales 

Here's a simple script to check the standard locales.

    import codecs
    import ctypes
    kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)

    LOCALE_ALL = 0
    LOCALE_WINDOWS = 1
    LOCALE_IDEFAULTANSICODEPAGE = 0x1004
    LOCALE_IDEFAULTCODEPAGE = 0x000B # OEM

    EnumSystemLocalesEx = kernel32.EnumSystemLocalesEx
    GetLocaleInfoEx = kernel32.GetLocaleInfoEx
    GetCPInfoExW = kernel32.GetCPInfoExW

    EnumLocalesProcEx = ctypes.WINFUNCTYPE(ctypes.c_int,
        ctypes.c_wchar_p, ctypes.c_ulong, ctypes.c_void_p)

    class CPINFOEXW(ctypes.Structure):
         _fields_ = (('MaxCharSize', ctypes.c_uint),
                     ('DefaultChar', ctypes.c_ubyte * 2),
                     ('LeadByte', ctypes.c_ubyte * 12),
                     ('UnicodeDefaultChar', ctypes.c_wchar),
                     ('CodePage', ctypes.c_uint),
                     ('CodePageName', ctypes.c_wchar * 260))

    def get_all_locale_code_pages():
        result = []
        seen = set()
        info = (ctypes.c_wchar * 100)()

        @EnumLocalesProcEx
        def callback(locale, flags, param):
            for lctype in (LOCALE_IDEFAULTANSICODEPAGE, LOCALE_IDEFAULTCODEPAGE):
                if (GetLocaleInfoEx(locale, lctype, info, len(info)) and
                      info.value not in ('0', '1')):
                    cp = int(info.value)
                    if cp in seen:
                        continue
                    seen.add(cp)
                    cp_info = CPINFOEXW()
                    if not GetCPInfoExW(cp, 0, ctypes.byref(cp_info)):
                        cp_info.CodePage = cp
                        cp_info.CodePageName = str(cp)
                    result.append(cp_info)
            return True

        if not EnumSystemLocalesEx(callback, LOCALE_WINDOWS, None, None):
            raise ctypes.WinError(ctypes.get_last_error())

        result.sort(key=lambda x: x.CodePage)
        return result

    supported = []
    unsupported = []
    for cp_info in get_all_locale_code_pages():
        cp = cp_info.CodePage
        try:
            codecs.lookup(f'cp{cp}')
        except LookupError:
            unsupported.append(cp_info)
        else:
            supported.append(cp_info)

    if unsupported:
        print('Unsupported:\n')
        for cp_info in unsupported:
            print(cp_info.CodePageName)
        print('\nSupported:\n')
    else:
        print('All Supported:\n')
    for cp_info in supported:
        print(cp_info.CodePageName)


Output:

    All Supported:

    437   (OEM - United States)
    720   (Arabic - Transparent ASMO)
    737   (OEM - Greek 437G)
    775   (OEM - Baltic)
    850   (OEM - Multilingual Latin I)
    852   (OEM - Latin II)
    855   (OEM - Cyrillic)
    857   (OEM - Turkish)
    862   (OEM - Hebrew)
    866   (OEM - Russian)
    874   (ANSI/OEM - Thai)
    932   (ANSI/OEM - Japanese Shift-JIS)
    936   (ANSI/OEM - Simplified Chinese GBK)
    949   (ANSI/OEM - Korean)
    950   (ANSI/OEM - Traditional Chinese Big5)
    1250  (ANSI - Central Europe)
    1251  (ANSI - Cyrillic)
    1252  (ANSI - Latin I)
    1253  (ANSI - Greek)
    1254  (ANSI - Turkish)
    1255  (ANSI - Hebrew)
    1256  (ANSI - Arabic)
    1257  (ANSI - Baltic)
    1258  (ANSI/OEM - Viet Nam)

Some locales are Unicode only (e.g. Hindi-India) or have no OEM code page, which the above code skips by checking for "0" or "1" as the code page value. Windows 10+ allows setting the system locale to a Unicode-only locale, for which it uses UTF-8 (65001) for ANSI and OEM.

The OEM code page matters because the console input and output code pages default to OEM, e.g. for os.device_encoding(). The console's I/O code pages are used in Python by low-level os.read() and os.write(). Note that the console doesn't properly implement using UTF-8 (65001) as the input code page. In this case, input read from the console via ReadFile() or ReadConsoleA() has a null byte in place of each non-ASCII character.
msg412847 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-02-08 17:09
I created GH-31218 which basically restores Python 3.10 code but enhances the test.
msg413825 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-02-23 17:14
commit ccbe8045faf6e63d36229ea4e1b9298572cda126
Author: Victor Stinner <vstinner@python.org>
Date:   Tue Feb 22 22:04:07 2022 +0100

    bpo-46659: Fix the MBCS codec alias on Windows (GH-31218)
History
Date User Action Args
2022-04-11 14:59:55adminsetgithub: 90826
2022-02-23 17:14:28vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg413825

stage: patch review -> resolved
2022-02-08 17:09:32vstinnersetmessages: + msg412847
2022-02-07 17:53:46eryksunsetmessages: + msg412777
2022-02-07 13:00:07vstinnersetmessages: + msg412738
2022-02-07 00:10:02eryksunsetnosy: + eryksun
messages: + msg412691
2022-02-06 23:17:16vstinnersetkeywords: + patch
stage: patch review
pull_requests: + pull_request29345
2022-02-06 23:13:19vstinnersetmessages: + msg412683
2022-02-06 23:10:38vstinnersetmessages: + msg412680
2022-02-06 23:06:49vstinnercreate