Message 350485 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	eryksun
Recipients	eryksun, paul.moore, steve.dower, tim.golden, xtreak, zach.ware
Date	2019-08-26.05:11:16
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1566796277.17.0.314381965288.issue37945@roundup.psfhosted.org>
In-reply-to

Content
local.normalize is generally wrong in Windows. It's meant for POSIX systems. Currently "tr_TR" is parsed as follows: >>> locale._parse_localename('tr_TR') ('tr_TR', 'ISO8859-9') The encoding "ISO8859-9" is meaningless to Windows. Also, the old CRT only ever supported either full language/country names or non-standard abbreviations -- e.g. either "Turkish_Turkey" or "trk_TUR". Having locale.getdefaultlocale() return ISO two-letter codes (e.g. "en_GB") was fundamentally wrong for the old CRT. (2.7 will die with this wart.) 3.5+ uses the Universal CRT, which does support standard ISO codes, but only in BCP 47 [1] locale names of the following form: language ISO 639 ["-" script] ISO 15924 ["-" region] ISO 3166-1 BCP 47 locale names have been preferred by Windows for the past 13 years, since Vista was released. Windows extends BCP 47 with a non-standard sort-order field (e.g. "de-Latn-DE_phoneb" is the German language with Latin script in the region of Germany with phone-book sort order). Another departure from strict BCP 47 in Windows is allowing underscore to be used as the delimiter instead of hyphen. In a concession to existing C code, the Universal CRT also supports an encoding suffix in BCP 47 locales, but this can only be either ".utf-8" or ".utf8". (Windows itself does not support specifying an encoding in a locale name, but it's Unicode anyway.) No other encoding is allowed. If ".utf-8" isn't specified, a BCP 47 locale defaults to the locale's ANSI codepage. However, there's no way to convey this in the locale name itself. Also, if a locale is Unicode only (e.g. Hindi), the CRT implicitly uses UTF-8 even without the ".utf-8" suffix. The following are valid BCP 47 locale names in the CRT: "tr", "tr.utf-8", "tr-TR", "tr_TR", "tr_TR.utf8", or "tr-Latn-TR.utf-8". But note that "tr_TR.1254" is not supported. The following shows that omitting the optional "utf-8" encoding in a BCP 47 locale makes the CRT default to the associated ANSI codepage. >>> locale.setlocale(locale.LC_CTYPE, 'tr_TR') 'tr_TR' >>> ucrt.___lc_codepage_func() 1254 C ___lc_codepage_func() queries the codepage of the current locale. We can directly query this codepage for a BCP 47 locale via GetLocaleInfoEx: >>> cpstr = (ctypes.c_wchar * 6)() >>> kernel32.GetLocaleInfoEx('tr-TR', ... LOCALE_IDEFAULTANSICODEPAGE, cpstr, len(cpstr)) 5 >>> cpstr.value '1254' If the result is '0', it's a Unicode-only locale (e.g. 'hi-IN' -- Hindi, India). Recent versions of the CRT use UTF-8 (codepage 65001) for Unicode-only locales: >>> locale.setlocale(locale.LC_CTYPE, 'hi-IN') 'hi-IN' >>> ucrt.___lc_codepage_func() 65001 Here are some example locale tuples that should be supported, given that the CRT continues to support full English locale names and non-standard abbreviations, in addition to the new BCP 47 names: ('tr', None) ('tr_TR', None) ('tr_Latn_TR, None) ('tr_TR', 'utf-8') ('trk_TUR', '1254') ('Turkish_Turkey', '1254') The return value from C setlocale can be normalized to replace hyphen delimiters with underscores, and "utf8" can be normalized as "utf-8". If it's a BCP 47 locale that has no encoding, GetLocaleInfoEx can be called to query the ANSI codepage. UTF-8 can be assumed if it's a Unicode-only locale. As to prefixing a codepage with 'cp', we don't really need to do this. We have aliases defined for most, such as '1252' -> 'cp1252'. But if the 'cp' prefix does get added, then the locale module should at least know to remove it when building a locale name from a tuple. [1] https://tools.ietf.org/rfc/bcp/bcp47.txt

local.normalize is generally wrong in Windows. It's meant for POSIX systems. Currently "tr_TR" is parsed as follows:

    >>> locale._parse_localename('tr_TR')
    ('tr_TR', 'ISO8859-9')

The encoding "ISO8859-9" is meaningless to Windows. Also, the old CRT only ever supported either full language/country names or non-standard abbreviations -- e.g. either "Turkish_Turkey" or "trk_TUR". Having locale.getdefaultlocale() return ISO two-letter codes (e.g. "en_GB") was fundamentally wrong for the old CRT. (2.7 will die with this wart.)

3.5+ uses the Universal CRT, which does support standard ISO codes, but only in BCP 47 [1] locale names of the following form:

    language           ISO 639
    ["-" script]       ISO 15924
    ["-" region]       ISO 3166-1

BCP 47 locale names have been preferred by Windows for the past 13 years, since Vista was released. Windows extends BCP 47 with a non-standard sort-order field (e.g. "de-Latn-DE_phoneb" is the German language with Latin script in the region of Germany with phone-book sort order). Another departure from strict BCP 47 in Windows is allowing underscore to be used as the delimiter instead of hyphen. 

In a concession to existing C code, the Universal CRT also supports an encoding suffix in BCP 47 locales, but this can only be either ".utf-8" or ".utf8". (Windows itself does not support specifying an encoding in a locale name, but it's Unicode anyway.) No other encoding is allowed. If ".utf-8" isn't specified, a BCP 47 locale defaults to the locale's ANSI codepage. However, there's no way to convey this in the locale name itself. Also, if a locale is Unicode only (e.g. Hindi), the CRT implicitly uses UTF-8 even without the ".utf-8" suffix.

The following are valid BCP 47 locale names in the CRT: "tr", "tr.utf-8", "tr-TR", "tr_TR", "tr_TR.utf8", or "tr-Latn-TR.utf-8". But note that "tr_TR.1254" is not supported.

The following shows that omitting the optional "utf-8" encoding in a BCP 47 locale makes the CRT default to the associated ANSI codepage. 

    >>> locale.setlocale(locale.LC_CTYPE, 'tr_TR')
    'tr_TR'
    >>> ucrt.___lc_codepage_func()
    1254

C ___lc_codepage_func() queries the codepage of the current locale. We can directly query this codepage for a BCP 47 locale via GetLocaleInfoEx:

    >>> cpstr = (ctypes.c_wchar * 6)()
    >>> kernel32.GetLocaleInfoEx('tr-TR',
    ...     LOCALE_IDEFAULTANSICODEPAGE, cpstr, len(cpstr))
    5
    >>> cpstr.value
    '1254'

If the result is '0', it's a Unicode-only locale (e.g. 'hi-IN' -- Hindi, India). Recent versions of the CRT use UTF-8 (codepage 65001) for Unicode-only locales:

    >>> locale.setlocale(locale.LC_CTYPE, 'hi-IN')
    'hi-IN'
    >>> ucrt.___lc_codepage_func()
    65001

Here are some example locale tuples that should be supported, given that the CRT continues to support full English locale names and non-standard abbreviations, in addition to the new BCP 47 names:

    ('tr', None)
    ('tr_TR', None)
    ('tr_Latn_TR, None)
    ('tr_TR', 'utf-8')
    
    ('trk_TUR', '1254')
    ('Turkish_Turkey', '1254')

The return value from C setlocale can be normalized to replace hyphen delimiters with underscores, and "utf8" can be normalized as "utf-8". If it's a BCP 47 locale that has no encoding, GetLocaleInfoEx can be called to query the ANSI codepage. UTF-8 can be assumed if it's a Unicode-only locale. 

As to prefixing a codepage with 'cp', we don't really need to do this. We have aliases defined for most, such as '1252' -> 'cp1252'. But if the 'cp' prefix does get added, then the locale module should at least know to remove it when building a locale name from a tuple.

[1] https://tools.ietf.org/rfc/bcp/bcp47.txt

History
Date	User	Action	Args
2019-08-26 05:11:17	eryksun	set	recipients: + eryksun, paul.moore, tim.golden, zach.ware, steve.dower, xtreak
2019-08-26 05:11:17	eryksun	set	messageid: <1566796277.17.0.314381965288.issue37945@roundup.psfhosted.org>
2019-08-26 05:11:17	eryksun	link	issue37945 messages
2019-08-26 05:11:16	eryksun	create