Message350485
local.normalize is generally wrong in Windows. It's meant for POSIX systems. Currently "tr_TR" is parsed as follows:
>>> locale._parse_localename('tr_TR')
('tr_TR', 'ISO8859-9')
The encoding "ISO8859-9" is meaningless to Windows. Also, the old CRT only ever supported either full language/country names or non-standard abbreviations -- e.g. either "Turkish_Turkey" or "trk_TUR". Having locale.getdefaultlocale() return ISO two-letter codes (e.g. "en_GB") was fundamentally wrong for the old CRT. (2.7 will die with this wart.)
3.5+ uses the Universal CRT, which does support standard ISO codes, but only in BCP 47 [1] locale names of the following form:
language ISO 639
["-" script] ISO 15924
["-" region] ISO 3166-1
BCP 47 locale names have been preferred by Windows for the past 13 years, since Vista was released. Windows extends BCP 47 with a non-standard sort-order field (e.g. "de-Latn-DE_phoneb" is the German language with Latin script in the region of Germany with phone-book sort order). Another departure from strict BCP 47 in Windows is allowing underscore to be used as the delimiter instead of hyphen.
In a concession to existing C code, the Universal CRT also supports an encoding suffix in BCP 47 locales, but this can only be either ".utf-8" or ".utf8". (Windows itself does not support specifying an encoding in a locale name, but it's Unicode anyway.) No other encoding is allowed. If ".utf-8" isn't specified, a BCP 47 locale defaults to the locale's ANSI codepage. However, there's no way to convey this in the locale name itself. Also, if a locale is Unicode only (e.g. Hindi), the CRT implicitly uses UTF-8 even without the ".utf-8" suffix.
The following are valid BCP 47 locale names in the CRT: "tr", "tr.utf-8", "tr-TR", "tr_TR", "tr_TR.utf8", or "tr-Latn-TR.utf-8". But note that "tr_TR.1254" is not supported.
The following shows that omitting the optional "utf-8" encoding in a BCP 47 locale makes the CRT default to the associated ANSI codepage.
>>> locale.setlocale(locale.LC_CTYPE, 'tr_TR')
'tr_TR'
>>> ucrt.___lc_codepage_func()
1254
C ___lc_codepage_func() queries the codepage of the current locale. We can directly query this codepage for a BCP 47 locale via GetLocaleInfoEx:
>>> cpstr = (ctypes.c_wchar * 6)()
>>> kernel32.GetLocaleInfoEx('tr-TR',
... LOCALE_IDEFAULTANSICODEPAGE, cpstr, len(cpstr))
5
>>> cpstr.value
'1254'
If the result is '0', it's a Unicode-only locale (e.g. 'hi-IN' -- Hindi, India). Recent versions of the CRT use UTF-8 (codepage 65001) for Unicode-only locales:
>>> locale.setlocale(locale.LC_CTYPE, 'hi-IN')
'hi-IN'
>>> ucrt.___lc_codepage_func()
65001
Here are some example locale tuples that should be supported, given that the CRT continues to support full English locale names and non-standard abbreviations, in addition to the new BCP 47 names:
('tr', None)
('tr_TR', None)
('tr_Latn_TR, None)
('tr_TR', 'utf-8')
('trk_TUR', '1254')
('Turkish_Turkey', '1254')
The return value from C setlocale can be normalized to replace hyphen delimiters with underscores, and "utf8" can be normalized as "utf-8". If it's a BCP 47 locale that has no encoding, GetLocaleInfoEx can be called to query the ANSI codepage. UTF-8 can be assumed if it's a Unicode-only locale.
As to prefixing a codepage with 'cp', we don't really need to do this. We have aliases defined for most, such as '1252' -> 'cp1252'. But if the 'cp' prefix does get added, then the locale module should at least know to remove it when building a locale name from a tuple.
[1] https://tools.ietf.org/rfc/bcp/bcp47.txt |
|
Date |
User |
Action |
Args |
2019-08-26 05:11:17 | eryksun | set | recipients:
+ eryksun, paul.moore, tim.golden, zach.ware, steve.dower, xtreak |
2019-08-26 05:11:17 | eryksun | set | messageid: <1566796277.17.0.314381965288.issue37945@roundup.psfhosted.org> |
2019-08-26 05:11:17 | eryksun | link | issue37945 messages |
2019-08-26 05:11:16 | eryksun | create | |
|