Message412777
> I don't think that this fallback is needed anymore. Which Windows
> code page can be used as ANSI code page which is not already
> implemented as a Python codec?
Python has full coverage of the ANSI and OEM code pages in the standard Windows locales, but I don't have any experience with custom (i.e. supplemental or replacement) locales.
https://docs.microsoft.com/en-us/windows/win32/intl/custom-locales
Here's a simple script to check the standard locales.
import codecs
import ctypes
kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)
LOCALE_ALL = 0
LOCALE_WINDOWS = 1
LOCALE_IDEFAULTANSICODEPAGE = 0x1004
LOCALE_IDEFAULTCODEPAGE = 0x000B # OEM
EnumSystemLocalesEx = kernel32.EnumSystemLocalesEx
GetLocaleInfoEx = kernel32.GetLocaleInfoEx
GetCPInfoExW = kernel32.GetCPInfoExW
EnumLocalesProcEx = ctypes.WINFUNCTYPE(ctypes.c_int,
ctypes.c_wchar_p, ctypes.c_ulong, ctypes.c_void_p)
class CPINFOEXW(ctypes.Structure):
_fields_ = (('MaxCharSize', ctypes.c_uint),
('DefaultChar', ctypes.c_ubyte * 2),
('LeadByte', ctypes.c_ubyte * 12),
('UnicodeDefaultChar', ctypes.c_wchar),
('CodePage', ctypes.c_uint),
('CodePageName', ctypes.c_wchar * 260))
def get_all_locale_code_pages():
result = []
seen = set()
info = (ctypes.c_wchar * 100)()
@EnumLocalesProcEx
def callback(locale, flags, param):
for lctype in (LOCALE_IDEFAULTANSICODEPAGE, LOCALE_IDEFAULTCODEPAGE):
if (GetLocaleInfoEx(locale, lctype, info, len(info)) and
info.value not in ('0', '1')):
cp = int(info.value)
if cp in seen:
continue
seen.add(cp)
cp_info = CPINFOEXW()
if not GetCPInfoExW(cp, 0, ctypes.byref(cp_info)):
cp_info.CodePage = cp
cp_info.CodePageName = str(cp)
result.append(cp_info)
return True
if not EnumSystemLocalesEx(callback, LOCALE_WINDOWS, None, None):
raise ctypes.WinError(ctypes.get_last_error())
result.sort(key=lambda x: x.CodePage)
return result
supported = []
unsupported = []
for cp_info in get_all_locale_code_pages():
cp = cp_info.CodePage
try:
codecs.lookup(f'cp{cp}')
except LookupError:
unsupported.append(cp_info)
else:
supported.append(cp_info)
if unsupported:
print('Unsupported:\n')
for cp_info in unsupported:
print(cp_info.CodePageName)
print('\nSupported:\n')
else:
print('All Supported:\n')
for cp_info in supported:
print(cp_info.CodePageName)
Output:
All Supported:
437 (OEM - United States)
720 (Arabic - Transparent ASMO)
737 (OEM - Greek 437G)
775 (OEM - Baltic)
850 (OEM - Multilingual Latin I)
852 (OEM - Latin II)
855 (OEM - Cyrillic)
857 (OEM - Turkish)
862 (OEM - Hebrew)
866 (OEM - Russian)
874 (ANSI/OEM - Thai)
932 (ANSI/OEM - Japanese Shift-JIS)
936 (ANSI/OEM - Simplified Chinese GBK)
949 (ANSI/OEM - Korean)
950 (ANSI/OEM - Traditional Chinese Big5)
1250 (ANSI - Central Europe)
1251 (ANSI - Cyrillic)
1252 (ANSI - Latin I)
1253 (ANSI - Greek)
1254 (ANSI - Turkish)
1255 (ANSI - Hebrew)
1256 (ANSI - Arabic)
1257 (ANSI - Baltic)
1258 (ANSI/OEM - Viet Nam)
Some locales are Unicode only (e.g. Hindi-India) or have no OEM code page, which the above code skips by checking for "0" or "1" as the code page value. Windows 10+ allows setting the system locale to a Unicode-only locale, for which it uses UTF-8 (65001) for ANSI and OEM.
The OEM code page matters because the console input and output code pages default to OEM, e.g. for os.device_encoding(). The console's I/O code pages are used in Python by low-level os.read() and os.write(). Note that the console doesn't properly implement using UTF-8 (65001) as the input code page. In this case, input read from the console via ReadFile() or ReadConsoleA() has a null byte in place of each non-ASCII character. |
|
Date |
User |
Action |
Args |
2022-02-07 17:53:46 | eryksun | set | recipients:
+ eryksun, vstinner |
2022-02-07 17:53:46 | eryksun | set | messageid: <1644256426.75.0.396130146485.issue46668@roundup.psfhosted.org> |
2022-02-07 17:53:46 | eryksun | link | issue46668 messages |
2022-02-07 17:53:46 | eryksun | create | |
|