classification
Title: [Windows] locale.getdefaultlocale() issues on Windows: test_locale.test_getsetlocale_issue1813()
Type: behavior Stage:
Components: Library (Lib), Windows Versions: Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: tim.golden Nosy List: eryksun, guy.linton, paul.moore, steve.dower, tim.golden, xtreak, zach.ware
Priority: normal Keywords:

Created on 2019-08-25 17:33 by tim.golden, last changed 2019-09-30 18:34 by vstinner.

Files
File name Uploaded Description Edit
pythoninfo.txt tim.golden, 2019-08-25 17:33 python -mtest.pythoninfo
Messages (18)
msg350466 - (view) Author: Tim Golden (tim.golden) * (Python committer) Date: 2019-08-25 17:33
On a Win10 machine I'm consistently seeing test_locale (and test__locale) fail. I'll attach pythoninfo.

======================================================================
ERROR: test_getsetlocale_issue1813 (test.test_locale.TestMiscellaneous)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\tim\work-in-progress\cpython\lib\test\test_locale.py", line 531, in test_getsetlocale_issue1813
    locale.setlocale(locale.LC_CTYPE, loc)
  File "C:\Users\tim\work-in-progress\cpython\lib\locale.py", line 604, in setlocale
    return _setlocale(category, locale)
locale.Error: unsupported locale setting
msg350470 - (view) Author: Tim Golden (tim.golden) * (Python committer) Date: 2019-08-25 19:29
Ok; so basically this doesn't work:

<code>
import locale
locale.setlocale(locale.LC_CTYPE, locale.getdefaultlocale())
</code>

It gives "locale.Error: unsupported locale setting" which comes from https://github.com/python/cpython/blob/master/Modules/_localemodule.c#L107

(For locale.getdefaultlocale() you could substitute locale.getlocale() or simply ("en_GB", "cp1252")). On my machine it raises that exception on Python 2.7.15, 3.6.6 and on master. 

Interestingly, none of the other tests in test_locale appear to exercise the 2-tuple 2nd param to setlocale. When you call setlocale and it returns the previous setting, it's a single string, eg "en_GB" etc. Passing that back in works. But when you call getlocale, it returns the 2-tuple, eg ("en_GB", "cp1252"). But all the other tests use the setlocale-returns-current trick for their setup/teardown.

I've quickly tested on 3.5 on Linux and the 2-tuple version works ok. I assume it's working on buildbots or we'd see the Turkish test failing every time. So is there something different about my C runtime, I wonder?
msg350471 - (view) Author: Tim Golden (tim.golden) * (Python committer) Date: 2019-08-25 19:34
Just to save you looking, the code in https://github.com/python/cpython/blob/master/Modules/_localemodule.c#L107 converts the 2-tuple to lang.encoding form so the C module is seeing "en_GB.cp1252"
msg350485 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2019-08-26 05:11
local.normalize is generally wrong in Windows. It's meant for POSIX systems. Currently "tr_TR" is parsed as follows:

    >>> locale._parse_localename('tr_TR')
    ('tr_TR', 'ISO8859-9')

The encoding "ISO8859-9" is meaningless to Windows. Also, the old CRT only ever supported either full language/country names or non-standard abbreviations -- e.g. either "Turkish_Turkey" or "trk_TUR". Having locale.getdefaultlocale() return ISO two-letter codes (e.g. "en_GB") was fundamentally wrong for the old CRT. (2.7 will die with this wart.)

3.5+ uses the Universal CRT, which does support standard ISO codes, but only in BCP 47 [1] locale names of the following form:

    language           ISO 639
    ["-" script]       ISO 15924
    ["-" region]       ISO 3166-1

BCP 47 locale names have been preferred by Windows for the past 13 years, since Vista was released. Windows extends BCP 47 with a non-standard sort-order field (e.g. "de-Latn-DE_phoneb" is the German language with Latin script in the region of Germany with phone-book sort order). Another departure from strict BCP 47 in Windows is allowing underscore to be used as the delimiter instead of hyphen. 

In a concession to existing C code, the Universal CRT also supports an encoding suffix in BCP 47 locales, but this can only be either ".utf-8" or ".utf8". (Windows itself does not support specifying an encoding in a locale name, but it's Unicode anyway.) No other encoding is allowed. If ".utf-8" isn't specified, a BCP 47 locale defaults to the locale's ANSI codepage. However, there's no way to convey this in the locale name itself. Also, if a locale is Unicode only (e.g. Hindi), the CRT implicitly uses UTF-8 even without the ".utf-8" suffix.

The following are valid BCP 47 locale names in the CRT: "tr", "tr.utf-8", "tr-TR", "tr_TR", "tr_TR.utf8", or "tr-Latn-TR.utf-8". But note that "tr_TR.1254" is not supported.

The following shows that omitting the optional "utf-8" encoding in a BCP 47 locale makes the CRT default to the associated ANSI codepage. 

    >>> locale.setlocale(locale.LC_CTYPE, 'tr_TR')
    'tr_TR'
    >>> ucrt.___lc_codepage_func()
    1254

C ___lc_codepage_func() queries the codepage of the current locale. We can directly query this codepage for a BCP 47 locale via GetLocaleInfoEx:

    >>> cpstr = (ctypes.c_wchar * 6)()
    >>> kernel32.GetLocaleInfoEx('tr-TR',
    ...     LOCALE_IDEFAULTANSICODEPAGE, cpstr, len(cpstr))
    5
    >>> cpstr.value
    '1254'

If the result is '0', it's a Unicode-only locale (e.g. 'hi-IN' -- Hindi, India). Recent versions of the CRT use UTF-8 (codepage 65001) for Unicode-only locales:

    >>> locale.setlocale(locale.LC_CTYPE, 'hi-IN')
    'hi-IN'
    >>> ucrt.___lc_codepage_func()
    65001

Here are some example locale tuples that should be supported, given that the CRT continues to support full English locale names and non-standard abbreviations, in addition to the new BCP 47 names:

    ('tr', None)
    ('tr_TR', None)
    ('tr_Latn_TR, None)
    ('tr_TR', 'utf-8')
    
    ('trk_TUR', '1254')
    ('Turkish_Turkey', '1254')

The return value from C setlocale can be normalized to replace hyphen delimiters with underscores, and "utf8" can be normalized as "utf-8". If it's a BCP 47 locale that has no encoding, GetLocaleInfoEx can be called to query the ANSI codepage. UTF-8 can be assumed if it's a Unicode-only locale. 

As to prefixing a codepage with 'cp', we don't really need to do this. We have aliases defined for most, such as '1252' -> 'cp1252'. But if the 'cp' prefix does get added, then the locale module should at least know to remove it when building a locale name from a tuple.

[1] https://tools.ietf.org/rfc/bcp/bcp47.txt
msg350491 - (view) Author: Tim Golden (tim.golden) * (Python committer) Date: 2019-08-26 06:52
Thanks, Eryk. Your explanation is as clear as always. But my question is, then: why is my machine failing this test [the only one which uses this two-part locale] and not the buildbots or (presumably) any other Windows developer?
msg350510 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2019-08-26 08:16
> But my question is, then: why is my machine failing this test [the 
> only one which uses this two-part locale] and not the buildbots or 
> (presumably) any other Windows developer?

test_getsetlocale_issue1813 fails for me as well. I can't imagine how setlocale(LC_CTYPE, "tr_TR.ISO8859-9") would succeed with recent versions of the Universal CRT in Windows. It parses "tr_TR" as a BCP 47 locale name, which only supports UTF-8 (e.g. "tr_TR.utf-8") and implicit ANSI (e.g. "tr_TR"). Plus "ISO8859-9" in general isn't a supported encoding of the form ".<codepage>", ".ACP" (ANSI), ".utf8", or ".utf-8". 

With the old CRT (2.x and <=3.4) and older versions of the Universal CRT, the initial locale.setlocale(locale.LC_CTYPE 'tr_TR') call fails as an unsupported locale, so the test is skipped: 

    test_getsetlocale_issue1813 (__main__.TestMiscellaneous) ... skipped 'test needs Turkish locale'

The old CRT only supports "trk_TUR", "trk_Turkey", "turkish_TUR", and "turkish_Turkey".
msg350548 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-08-26 16:33
So is the fix here to update locale._build_localename to check something like this?

if encoding is None:
    return language
elif sys.platform == 'win32' and encoding not in {'utf8', 'utf-8'}:
    return language
else:
    return language + '.' + encoding
msg350549 - (view) Author: Tim Golden (tim.golden) * (Python committer) Date: 2019-08-26 17:19
I agree that that could be a fix. And certainly, if it turns out that this could never have (recently) worked as Eryk is suggesting, then let's go for it.

But I still have this uneasy feeling that it's not failing on the buildbots and I can't see any sign of a skipped test in the test stdio. I just wonder whether there's something else at play here.
msg350559 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-08-26 18:26
I pushed a custom buildbot run that only runs this test in verbose mode, and it looks like the test is being skipped some other way?

https://buildbot.python.org/all/#/builders/48/builds/36
https://buildbot.python.org/all/#/builders/42/builds/54

I don't see any evidence there that it's running at all, though I do on my own machine.

Perhaps one of the other buildbot settings causes it to run in a different order and something skips the entire class? I haven't dug in enough to figure that out yet.
msg350568 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2019-08-26 20:51
We get into trouble with test_getsetlocale_issue1813 because normalize() maps "tr_TR" (supported) to "tr_TR.ISO8859-9" (not supported).

    >>> locale.normalize('tr_TR')
    'tr_TR.ISO8859-9'

We should skip normalize() in Windows. It's based on a POSIX locale_alias mapping that can only cause problems. The work for normalizing locale names in Windows is best handled inline in _build_localename and _parse_localename.

For the old long form, C setlocale always returns the codepage encoding (e.g. "Turkish_Turkey.1254") or "utf8", so that's simple to parse. For BCP 47 locales, the encoding is either "utf8" or "utf-8", or nothing at all. For the latter, there's an implied legacy ANSI encoding. This is used by the CRT wherever we depend on byte strings, such as in time.strftime:

mojibake:

    >>> locale.setlocale(locale.LC_CTYPE, 'en_GB')
    'en_GB'
    >>> time.strftime("\u0100")
    'A'

correct:

    >>> locale.setlocale(locale.LC_CTYPE, 'en_GB.utf-8')
    'en_GB.utf-8'
    >>> time.strftime("\u0100")
    'Ā'

(We should switch back to using wcsftime if possible.)

The implicit BCP-47 case can be parsed as `None` -- e.g. ("tr_TR", None). However, it might be useful to support getting the ANSI codepage via GetLocaleInfoEx [1]. A high-level function in locale could internally call _locale.getlocaleinfo(locale_name, LOCALE_IDEFAULTANSICODEPAGE). This would return a string such as "1254". or "0" for a Unicode-only language. 

For _build_localename, we can't simply limit the encoding to UTF-8. We need to support the old long/abbreviated forms (e.g. "trk_TUR", "turkish_Turkey") in addition to the newer BCP 47 locale names. In the old form we have to support the following encodings:

    * codepage encodings, with an optional "cp" prefix that has 
      to be stripped, e.g. ("trk_TUR", "cp1254") -> "trk_TUR.1254"
    * "ACP" in upper case only -- for the ANSI codepage of the 
      language
    * "utf8" (mixed case) and "utf-8" (mixed case)

(The CRT documentation says "OEM" should also be supported, but it's not.)

A locale name can also omit the language in the old form -- e.g. (None, "ACP") or (None, "cp1254"). The CRT uses the current language in this case. This is discouraged because the result may be nonsense.

[1] https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getlocaleinfoex
msg350569 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-08-26 21:23
Oh yeah, that locale_alias table is useless on Windows :(

But at least the function is documented in such a way that we can change it: "The returned locale code is formatted for use with :func:`setlocale`."

Alternatively, we could make setlocale() do its own normalization step on Windows and ignore (or otherwise validate/reject) the encoding.

None of that explains why the test doesn't seem to run at all on the buildbots though.
msg350571 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2019-08-26 21:43
> None of that explains why the test doesn't seem to run at all on the 
> buildbots though.

Are the buildbots using an older version of UCRT? BCP 47 locales used to strictly require a hyphen as the delimiter (e.g. 'tr-TR') instead of underscore (e.g. 'tr_TR'). Supporting underscore and UTF-8 are relatively recent additions that aren't documented yet. Even WINAPI GetLocaleInfoEx supports underscore as the delimiter now, which is also undocumented behavior.
msg350573 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-08-26 21:46
> test_getsetlocale_issue1813 (test.test_locale.TestMiscellaneous) ... skipped 'test needs Turkish locale'

Yeah, looks like they're failing that part of the test. I'll run them again with the hyphen.
msg350574 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-08-26 21:51
Oh man, this is too broken for me to think about today...

If someone feels like writing a Windows-specific normalize() function to totally replace the Unix one, feel free, but it looks like we won't be able to get away with anything less. The "easy" change breaks a variety of other tests.
msg350598 - (view) Author: Tim Golden (tim.golden) * (Python committer) Date: 2019-08-27 04:59
This feels like one of those changes where what's in place is clearly flawed but any change seems like it'll break stuff which people have had in place for years.

I'll try to look at a least-breaking change but I'm honestly not sure what that would look like.
msg350820 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2019-08-29 19:47
Here's some additional background information for work on this issue.

A Unix locale identifier has the following form:

    "language[_territory][.codeset][@modifier]"
        | "POSIX"
        | "C"
        | ""
        | NULL

(X/Open Portability Guide, Issue 4, 1992 -- aka XPG4)

Some systems also implement "C.UTF-8". 

The language and territory should use ISO 639 and ISO 3166 alpha-2 codes. The "@" modifier may indicate an alternate script such as "sr_RS@latin" or an alternate currency such as "de_DE@euro". For the optional codeset, IANA publishes the following table of character sets:

http://www.iana.org/assignments/character-sets/character-sets.xhtml

In Debian Linux, the available encodings are defined by mapping files in "/usr/share/i18n/charmaps". But encodings can't be arbitrarily used in locales at run time. A locale has to be generated (see "/etc/locale.gen") before it's available. 

A Windows (not ucrt) locale name has the following form:

    "ISO639Language[-ISO15924Script][-ISO3166Region][SubTag][_SortOrder]"
        | ""                      | LOCALE_NAME_INVARIANT
        | "!x-sys-default-locale" | LOCALE_NAME_SYSTEM_DEFAULT
        | NULL                    | LOCALE_NAME_USER_DEFAULT

The invariant locale provides stable data. The system and user default locales vary according to the Control Panel "Region" settings.

A locale name is based on BCP 47 language tags, with the form "<language>-<script>-<region>"(e.g. "en-Latn-GB"), for which the script and region codes are optional. The language is an ISO 639 alpha-2 or alpha-3 code, with alpha-2 preferred. The script is an initial-uppercase ISO 15924 code. The region is an ISO 3166-1 alpha-2 or numeric-3 code, with alpha-2 preferred. 

As specified, the sort-order code should be delimited by an underscore, but Windows 10 (maybe older versions also?) accepts a hyphen instead. Here's a list of the sort-order codes that I've seen:

    * mathan - Math Alphanumerics       ( x-IV_mathan)
    * phoneb - Phone Book               (de-DE_phoneb)
    * modern - Modern                   (ka-GE_modern)
    * tradnl - Traditional              (es-ES_tradnl)
    * technl - Technical                (hu-HU_technl)
    * radstr - Radical/Stroke           (ja-JP_radstr)
    * stroke - Stroke Count             (zh-CN_stroke)
    * pronun - Pronunciation (Bopomofo) (zh-TW_pronun)

One final note of interest about Windows locales is that the user-interface language has been functionally isolated from the locale. The display language is handled by the Multilinugual User Interface (MUI) API, which depends on .mui files in locale-named subdirectories of a binary, such as "kernel32.dll" -> "en-US\kernel32.dll.mui". Windows 10 has an option to configure the user locale to match the preferred display language. This helps to keep the two in sync, but they're still functionally independent.

The Universal CRT (ucrt) in Windows supports the following syntax for a locale identifier:

    "ISO639Language[-ISO15924Script][-ISO3166Region][.utf8|.utf-8]"
        | "ISO639Language[-ISO15924Script][-ISO3166Region][SubTag][_SortOrder]"
        | "language[_region][.codepage|.utf8|.utf-8]"
        | ".codepage" | ".utf8" | ".utf-8"
        | "C"
        | ""
        | NULL

NULL is used with setlocale to query the current value of a category. The empty string "" is the current-user locale. "C" is a minimal locale. For LC_CTYPE, "C" uses Latin-1, but for LC_TIME it uses the system ANSI codepage (possibly multi-byte), which can lead to mojibake. The "POSIX" locale is not supported, nor is "C.UTF-8". 

Note that UTF-8 support is relatively new, as is the ability to set the encoding without also specifying a region (e.g. "english.utf8").

Recent versions of ucrt extend BCP-47 support in a couple of ways. Underscore is allowed in addition to hyphen as the tag delimiter (e.g "en_GB" instead of "en-GB"), and specifying UTF-8 as the encoding (and only UTF-8) is supported. If UTF-8 isn't specified, internally the locale defaults to the language's ANSI codepage. ucrt has to parse BCP 47 locales manually if they include an encoding, and also in some cases when underscore is used. Currently this fails to handle a sort-order tag, so we can't use, for example, "de_DE_phoneb.utf8".
msg350823 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2019-08-29 19:53
If normalize() is implemented for Windows, then the tests should be split out into POSIX and Windows versions. Currently, most of the tests in NormalizeTest are not checking a result that's properly normalized for ucrt.

A useful implementation of locale.normalize should allow a script to use ("en_US", "iso8859_1") in Windows without having to know that Latin-1 is Windows codepage 28591, or that ucrt requires a classic locale name if the encoding isn't UTF-8. The required result for setlocale() is "English_United States.28591". 

As far as aliases are concerned, at a minimum, we need to map "posix" and "c" to "C". We can also support "C.UTF-8" as "en_US.UTF-8". Do we need to support the Unix locale_alias mappings from X.org? If so, I suppose we could use a double mapping. First try the Unix locale_alias mapping. Then try that result in a windows_locale_alias mapping that includes additional mappings from Unix to Windows. For example: 

    sr_CS.UTF-8          -> sr_Cyrl_CS.UTF-8
    sr_CS.UTF-8@latin    -> sr_Latn_CS.UTF-8
    ca_ES.UTF-8@valencia -> ca_ES_valencia.UTF-8

Note that the last one doesn't currently work. "ca-ES-valencia" is a valid Windows locale name for the Valencian variant of Catalan (ca), which lacks an ISO 639 code of its own since it's officially (and somewhat controversially) designated as a dialect of Catalan. This is an unusual case that has a subtag after the region, which ucrt's manual BCP-47 parsing cannot handle. (It tries to parse "ES" as the script and "valencia" as an ISO 3166-1 country code.)

After mapping aliases, if the result still has "@" in it, normalize() should fail. We don't know what the "@" modifier means.

Otherwise, split the locale name and encoding parts. If the encoding isn't UTF-8, try to map it to a codepage. For this we need a  windows_codepage_alias dict that maps IANA official and Python-specific encoding names to Windows codepages. Next, check the locale name via WINAPI IsValidLocaleName. If it's not valid, try replacing underscore with hyphen and check again. Otherwise assume it's a classic ucrt locale name. (It may not be valid, but implementing all of the work ucrt does to parse a classic locale name is too much I think.) If it's a valid Windows locale name, and we have a codepage encoding, then try to translate it as a classic ucrt locale name. This requires two WINAPI GetLocaleInfoEx calls to look up the English versions of the language and country name.
msg351788 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-09-11 09:51
FYI I just closed issue10466 as a duplicate (even though that one's been around longer, this issue has more relevant information on it).
History
Date User Action Args
2019-09-30 18:34:32vstinnersettitle: test_locale failing -> [Windows] locale.getdefaultlocale() issues on Windows: test_locale.test_getsetlocale_issue1813()
2019-09-11 20:55:53guy.lintonsetnosy: + guy.linton
2019-09-11 09:51:28steve.dowersetmessages: + msg351788
2019-09-11 09:49:22steve.dowerlinkissue10466 superseder
2019-08-29 19:53:50eryksunsetmessages: + msg350823
2019-08-29 19:47:44eryksunsetmessages: + msg350820
2019-08-27 04:59:23tim.goldensetmessages: + msg350598
2019-08-26 21:51:59steve.dowersetmessages: + msg350574
2019-08-26 21:46:48steve.dowersetmessages: + msg350573
2019-08-26 21:43:13eryksunsetmessages: + msg350571
2019-08-26 21:23:53steve.dowersetmessages: + msg350569
2019-08-26 20:51:32eryksunsetmessages: + msg350568
2019-08-26 18:26:48steve.dowersetmessages: + msg350559
2019-08-26 17:19:31tim.goldensetmessages: + msg350549
2019-08-26 16:33:11steve.dowersetmessages: + msg350548
2019-08-26 08:16:44eryksunsetmessages: + msg350510
2019-08-26 06:52:36tim.goldensetmessages: + msg350491
2019-08-26 05:11:17eryksunsetmessages: + msg350485
2019-08-25 19:34:05tim.goldensetmessages: + msg350471
2019-08-25 19:29:13tim.goldensetmessages: + msg350470
2019-08-25 17:45:32xtreaksetnosy: + eryksun, paul.moore, xtreak, zach.ware, steve.dower
components: + Windows
2019-08-25 17:33:21tim.goldencreate