Issue46668
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2022-02-06 23:06 by vstinner, last changed 2022-04-11 14:59 by admin. This issue is now closed.
Pull Requests | |||
---|---|---|---|
URL | Status | Linked | Edit |
PR 31174 | closed | vstinner, 2022-02-06 23:17 |
Messages (8) | |||
---|---|---|---|
msg412678 - (view) | Author: STINNER Victor (vstinner) * | Date: 2022-02-06 23:06 | |
While working on bpo-46659, I found a bug in the encodings "mbcs" alias. Even if the function has 2 tests (in test_codecs and test_site), both tests missed the bug :-( I fixed the alias with this change: --- commit 04dd60e50cd3da48fd19cdab4c0e4cc600d6af30 Author: Victor Stinner <vstinner@python.org> Date: Sun Feb 6 21:50:09 2022 +0100 bpo-46659: Update the test on the mbcs codec alias (GH-31168) encodings registers the _alias_mbcs() codec search function before the search_function() codec search function. Previously, the _alias_mbcs() was never used. Fix the test_codecs.test_mbcs_alias() test: use the current ANSI code page, not a fake ANSI code page number. Remove the test_site.test_aliasing_mbcs() test: the alias is now implemented in the encodings module, no longer in the site module. --- But Eryk found two bugs: """ This was never true before. With 1252 as my ANSI code page, I checked codecs.lookup('cp1252') in 2.7, 3.4, 3.5, 3.6, 3.9, and 3.10, and none of them return the "mbcs" encoding. It's not equivalent, and not supposed to be. The implementation of "cp1252" should be cross-platform, regardless of whether we're on a Windows system with 1252 as the ANSI code page, as opposed to a Windows system with some other ANSI code page, or a Linux or macOS system. The differences are that "mbcs" maps every byte, whereas our code-page encodings do not map undefined bytes, and the "replace" handler of "mbcs" uses a best-fit mapping (e.g. "α" -> "a") when encoding text, instead of mapping all undefined characters to "?". """ and my new test fails if PYTHONUTF8=1 env var is set: """ This will fail if PYTHONUTF8 is set in the environment, because it overrides getpreferredencoding(False) and _get_locale_encoding(). """ The code for the "mbcs" alias changed at lot between Python 3.5 and 3.7. In Python 3.5, site module: --- def aliasmbcs(): """On Windows, some default encodings are not provided by Python, while they are always available as "mbcs" in each locale. Make them usable by aliasing to "mbcs" in such a case.""" if sys.platform == 'win32': import _bootlocale, codecs enc = _bootlocale.getpreferredencoding(False) if enc.startswith('cp'): # "cp***" ? try: codecs.lookup(enc) except LookupError: import encodings encodings._cache[enc] = encodings._unknown encodings.aliases.aliases[enc] = 'mbcs' --- In Python 3.6, encodings module: --- (...) codecs.register(search_function) if sys.platform == 'win32': def _alias_mbcs(encoding): try: import _bootlocale if encoding == _bootlocale.getpreferredencoding(False): import encodings.mbcs return encodings.mbcs.getregentry() except ImportError: # Imports may fail while we are shutting down pass codecs.register(_alias_mbcs) --- Python 3.7, encodings module: --- (...) codecs.register(search_function) if sys.platform == 'win32': def _alias_mbcs(encoding): try: import _winapi ansi_code_page = "cp%s" % _winapi.GetACP() if encoding == ansi_code_page: import encodings.mbcs return encodings.mbcs.getregentry() except ImportError: # Imports may fail while we are shutting down pass codecs.register(_alias_mbcs) --- The Python 3.6 and 3.7 "codecs.register(_alias_mbcs)" doesn't work because "search_function()" is tested before and it works for "cpXXX" encodings. My changes changes the order in which codecs search functions are registered: first the MBCS alias, then the encodings search_function(). In Python 3.5, the alias was only created if Python didn't support the code page. |
|||
msg412680 - (view) | Author: STINNER Victor (vstinner) * | Date: 2022-02-06 23:10 | |
The alias was created in 2003 to fix bpo-671666: --- commit 4eab486476c0082087a8460a5ab1064e64cc1a6b Author: Martin v. Löwis <martin@v.loewis.de> Date: Mon Mar 3 09:34:01 2003 +0000 Patch #671666: Alias ANSI code page to "mbcs". --- In 2003, bpo-671666 was created because Python didn't support "cp932" encoding, whereas the MBCS codec was available and could used directly since cp932 was the ANSI code page. The alias allows to support the ANSI code 932 without implement it. But Python got a "cp932" codec the year after: --- commit 3e2a30692085d32ac63f72b35da39158a471fc68 Author: Hye-Shik Chang <hyeshik@gmail.com> Date: Sat Jan 17 14:29:29 2004 +0000 Add CJK codecs support as discussed on python-dev. (SF #873597) Several style fixes are suggested by Martin v. Loewis and Marc-Andre Lemburg. Thanks! --- |
|||
msg412683 - (view) | Author: STINNER Victor (vstinner) * | Date: 2022-02-06 23:13 | |
Python 3.11 supports the 40 code pages: * 037 * 273 * 424 * 437 * 500 * 720 * 737 * 775 * 850 * 852 * 855 * 856 * 857 * 858 * 860 * 861 * 862 * 863 * 864 * 865 * 866 * 869 * 874 * 875 * 932 * 949 * 950 * 1006 * 1026 * 1125 * 1140 * 1250 * 1251 * 1252 * 1253 * 1254 * 1255 * 1256 * 1257 * 1258 |
|||
msg412691 - (view) | Author: Eryk Sun (eryksun) * | Date: 2022-02-07 00:10 | |
> The Python 3.6 and 3.7 "codecs.register(_alias_mbcs)" doesn't work > because "search_function()" is tested before and it works for "cpXXX" > encodings. Isn't the 3.6-3.10 ordering of search_function() and _alias_mbcs() correct as a fallback? In this case, Python doesn't support a cross-platform encoding for the code page. That's why the old implementation of test_mbcs_alias() mocked _winapi.GetACP() to return 123 and then checked that looking up 'cp123' returned the "mbcs" codec. I'd actually prefer to extend this by implementing _winapi.GetOEMCP() and using "oem" as a fallback for that case. |
|||
msg412738 - (view) | Author: STINNER Victor (vstinner) * | Date: 2022-02-07 13:00 | |
I don't think that this fallback is needed anymore. Which Windows code page can be used as ANSI code page which is not already implemented as a Python codec? |
|||
msg412777 - (view) | Author: Eryk Sun (eryksun) * | Date: 2022-02-07 17:53 | |
> I don't think that this fallback is needed anymore. Which Windows > code page can be used as ANSI code page which is not already > implemented as a Python codec? Python has full coverage of the ANSI and OEM code pages in the standard Windows locales, but I don't have any experience with custom (i.e. supplemental or replacement) locales. https://docs.microsoft.com/en-us/windows/win32/intl/custom-locales Here's a simple script to check the standard locales. import codecs import ctypes kernel32 = ctypes.WinDLL('kernel32', use_last_error=True) LOCALE_ALL = 0 LOCALE_WINDOWS = 1 LOCALE_IDEFAULTANSICODEPAGE = 0x1004 LOCALE_IDEFAULTCODEPAGE = 0x000B # OEM EnumSystemLocalesEx = kernel32.EnumSystemLocalesEx GetLocaleInfoEx = kernel32.GetLocaleInfoEx GetCPInfoExW = kernel32.GetCPInfoExW EnumLocalesProcEx = ctypes.WINFUNCTYPE(ctypes.c_int, ctypes.c_wchar_p, ctypes.c_ulong, ctypes.c_void_p) class CPINFOEXW(ctypes.Structure): _fields_ = (('MaxCharSize', ctypes.c_uint), ('DefaultChar', ctypes.c_ubyte * 2), ('LeadByte', ctypes.c_ubyte * 12), ('UnicodeDefaultChar', ctypes.c_wchar), ('CodePage', ctypes.c_uint), ('CodePageName', ctypes.c_wchar * 260)) def get_all_locale_code_pages(): result = [] seen = set() info = (ctypes.c_wchar * 100)() @EnumLocalesProcEx def callback(locale, flags, param): for lctype in (LOCALE_IDEFAULTANSICODEPAGE, LOCALE_IDEFAULTCODEPAGE): if (GetLocaleInfoEx(locale, lctype, info, len(info)) and info.value not in ('0', '1')): cp = int(info.value) if cp in seen: continue seen.add(cp) cp_info = CPINFOEXW() if not GetCPInfoExW(cp, 0, ctypes.byref(cp_info)): cp_info.CodePage = cp cp_info.CodePageName = str(cp) result.append(cp_info) return True if not EnumSystemLocalesEx(callback, LOCALE_WINDOWS, None, None): raise ctypes.WinError(ctypes.get_last_error()) result.sort(key=lambda x: x.CodePage) return result supported = [] unsupported = [] for cp_info in get_all_locale_code_pages(): cp = cp_info.CodePage try: codecs.lookup(f'cp{cp}') except LookupError: unsupported.append(cp_info) else: supported.append(cp_info) if unsupported: print('Unsupported:\n') for cp_info in unsupported: print(cp_info.CodePageName) print('\nSupported:\n') else: print('All Supported:\n') for cp_info in supported: print(cp_info.CodePageName) Output: All Supported: 437 (OEM - United States) 720 (Arabic - Transparent ASMO) 737 (OEM - Greek 437G) 775 (OEM - Baltic) 850 (OEM - Multilingual Latin I) 852 (OEM - Latin II) 855 (OEM - Cyrillic) 857 (OEM - Turkish) 862 (OEM - Hebrew) 866 (OEM - Russian) 874 (ANSI/OEM - Thai) 932 (ANSI/OEM - Japanese Shift-JIS) 936 (ANSI/OEM - Simplified Chinese GBK) 949 (ANSI/OEM - Korean) 950 (ANSI/OEM - Traditional Chinese Big5) 1250 (ANSI - Central Europe) 1251 (ANSI - Cyrillic) 1252 (ANSI - Latin I) 1253 (ANSI - Greek) 1254 (ANSI - Turkish) 1255 (ANSI - Hebrew) 1256 (ANSI - Arabic) 1257 (ANSI - Baltic) 1258 (ANSI/OEM - Viet Nam) Some locales are Unicode only (e.g. Hindi-India) or have no OEM code page, which the above code skips by checking for "0" or "1" as the code page value. Windows 10+ allows setting the system locale to a Unicode-only locale, for which it uses UTF-8 (65001) for ANSI and OEM. The OEM code page matters because the console input and output code pages default to OEM, e.g. for os.device_encoding(). The console's I/O code pages are used in Python by low-level os.read() and os.write(). Note that the console doesn't properly implement using UTF-8 (65001) as the input code page. In this case, input read from the console via ReadFile() or ReadConsoleA() has a null byte in place of each non-ASCII character. |
|||
msg412847 - (view) | Author: STINNER Victor (vstinner) * | Date: 2022-02-08 17:09 | |
I created GH-31218 which basically restores Python 3.10 code but enhances the test. |
|||
msg413825 - (view) | Author: STINNER Victor (vstinner) * | Date: 2022-02-23 17:14 | |
commit ccbe8045faf6e63d36229ea4e1b9298572cda126 Author: Victor Stinner <vstinner@python.org> Date: Tue Feb 22 22:04:07 2022 +0100 bpo-46659: Fix the MBCS codec alias on Windows (GH-31218) |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:59:55 | admin | set | github: 90826 |
2022-02-23 17:14:28 | vstinner | set | status: open -> closed resolution: fixed messages: + msg413825 stage: patch review -> resolved |
2022-02-08 17:09:32 | vstinner | set | messages: + msg412847 |
2022-02-07 17:53:46 | eryksun | set | messages: + msg412777 |
2022-02-07 13:00:07 | vstinner | set | messages: + msg412738 |
2022-02-07 00:10:02 | eryksun | set | nosy:
+ eryksun messages: + msg412691 |
2022-02-06 23:17:16 | vstinner | set | keywords:
+ patch stage: patch review pull_requests: + pull_request29345 |
2022-02-06 23:13:19 | vstinner | set | messages: + msg412683 |
2022-02-06 23:10:38 | vstinner | set | messages: + msg412680 |
2022-02-06 23:06:49 | vstinner | create |