Message 350568 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	eryksun
Recipients	eryksun, paul.moore, steve.dower, tim.golden, xtreak, zach.ware
Date	2019-08-26.20:51:32
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1566852692.73.0.80295634648.issue37945@roundup.psfhosted.org>
In-reply-to

Content
We get into trouble with test_getsetlocale_issue1813 because normalize() maps "tr_TR" (supported) to "tr_TR.ISO8859-9" (not supported). >>> locale.normalize('tr_TR') 'tr_TR.ISO8859-9' We should skip normalize() in Windows. It's based on a POSIX locale_alias mapping that can only cause problems. The work for normalizing locale names in Windows is best handled inline in _build_localename and _parse_localename. For the old long form, C setlocale always returns the codepage encoding (e.g. "Turkish_Turkey.1254") or "utf8", so that's simple to parse. For BCP 47 locales, the encoding is either "utf8" or "utf-8", or nothing at all. For the latter, there's an implied legacy ANSI encoding. This is used by the CRT wherever we depend on byte strings, such as in time.strftime: mojibake: >>> locale.setlocale(locale.LC_CTYPE, 'en_GB') 'en_GB' >>> time.strftime("\u0100") 'A' correct: >>> locale.setlocale(locale.LC_CTYPE, 'en_GB.utf-8') 'en_GB.utf-8' >>> time.strftime("\u0100") 'Ā' (We should switch back to using wcsftime if possible.) The implicit BCP-47 case can be parsed as `None` -- e.g. ("tr_TR", None). However, it might be useful to support getting the ANSI codepage via GetLocaleInfoEx [1]. A high-level function in locale could internally call _locale.getlocaleinfo(locale_name, LOCALE_IDEFAULTANSICODEPAGE). This would return a string such as "1254". or "0" for a Unicode-only language. For _build_localename, we can't simply limit the encoding to UTF-8. We need to support the old long/abbreviated forms (e.g. "trk_TUR", "turkish_Turkey") in addition to the newer BCP 47 locale names. In the old form we have to support the following encodings: * codepage encodings, with an optional "cp" prefix that has to be stripped, e.g. ("trk_TUR", "cp1254") -> "trk_TUR.1254" * "ACP" in upper case only -- for the ANSI codepage of the language * "utf8" (mixed case) and "utf-8" (mixed case) (The CRT documentation says "OEM" should also be supported, but it's not.) A locale name can also omit the language in the old form -- e.g. (None, "ACP") or (None, "cp1254"). The CRT uses the current language in this case. This is discouraged because the result may be nonsense. [1] https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getlocaleinfoex

We get into trouble with test_getsetlocale_issue1813 because normalize() maps "tr_TR" (supported) to "tr_TR.ISO8859-9" (not supported).

    >>> locale.normalize('tr_TR')
    'tr_TR.ISO8859-9'

We should skip normalize() in Windows. It's based on a POSIX locale_alias mapping that can only cause problems. The work for normalizing locale names in Windows is best handled inline in _build_localename and _parse_localename.

For the old long form, C setlocale always returns the codepage encoding (e.g. "Turkish_Turkey.1254") or "utf8", so that's simple to parse. For BCP 47 locales, the encoding is either "utf8" or "utf-8", or nothing at all. For the latter, there's an implied legacy ANSI encoding. This is used by the CRT wherever we depend on byte strings, such as in time.strftime:

mojibake:

    >>> locale.setlocale(locale.LC_CTYPE, 'en_GB')
    'en_GB'
    >>> time.strftime("\u0100")
    'A'

correct:

    >>> locale.setlocale(locale.LC_CTYPE, 'en_GB.utf-8')
    'en_GB.utf-8'
    >>> time.strftime("\u0100")
    'Ā'

(We should switch back to using wcsftime if possible.)

The implicit BCP-47 case can be parsed as `None` -- e.g. ("tr_TR", None). However, it might be useful to support getting the ANSI codepage via GetLocaleInfoEx [1]. A high-level function in locale could internally call _locale.getlocaleinfo(locale_name, LOCALE_IDEFAULTANSICODEPAGE). This would return a string such as "1254". or "0" for a Unicode-only language. 

For _build_localename, we can't simply limit the encoding to UTF-8. We need to support the old long/abbreviated forms (e.g. "trk_TUR", "turkish_Turkey") in addition to the newer BCP 47 locale names. In the old form we have to support the following encodings:

    * codepage encodings, with an optional "cp" prefix that has 
      to be stripped, e.g. ("trk_TUR", "cp1254") -> "trk_TUR.1254"
    * "ACP" in upper case only -- for the ANSI codepage of the 
      language
    * "utf8" (mixed case) and "utf-8" (mixed case)

(The CRT documentation says "OEM" should also be supported, but it's not.)

A locale name can also omit the language in the old form -- e.g. (None, "ACP") or (None, "cp1254"). The CRT uses the current language in this case. This is discouraged because the result may be nonsense.

[1] https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getlocaleinfoex

History
Date	User	Action	Args
2019-08-26 20:51:32	eryksun	set	recipients: + eryksun, paul.moore, tim.golden, zach.ware, steve.dower, xtreak
2019-08-26 20:51:32	eryksun	set	messageid: <1566852692.73.0.80295634648.issue37945@roundup.psfhosted.org>
2019-08-26 20:51:32	eryksun	link	issue37945 messages
2019-08-26 20:51:32	eryksun	create