Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Windows] test_locale.TestMiscellaneous.test_getsetlocale_issue1813() fails #82126

Closed
tjguk opened this issue Aug 25, 2019 · 38 comments
Closed
Assignees
Labels
3.8 only security fixes 3.9 only security fixes 3.10 only security fixes OS-windows stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@tjguk
Copy link
Member

tjguk commented Aug 25, 2019

BPO 37945
Nosy @terryjreedy, @pfmoore, @db3l, @vstinner, @tjguk, @jkloth, @ned-deily, @methane, @ambv, @zware, @serhiy-storchaka, @eryksun, @zooba, @miss-islington, @tirkarthi
PRs
  • bpo-37945: Fix test_locale.test_getsetlocale_issue1813() #25110
  • [3.9] bpo-37945: Fix test_locale.test_getsetlocale_issue1813() (GH-25110) #25112
  • [3.8] bpo-37945: Fix test_locale.test_getsetlocale_issue1813() (GH-25110) #25113
  • Files
  • pythoninfo.txt: python -mtest.pythoninfo
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/tjguk'
    closed_at = <Date 2021-03-31.12:08:34.105>
    created_at = <Date 2019-08-25.17:33:21.926>
    labels = ['type-bug', '3.8', '3.9', '3.10', 'library', 'OS-windows']
    title = '[Windows] test_locale.TestMiscellaneous.test_getsetlocale_issue1813() fails'
    updated_at = <Date 2021-03-31.22:47:06.662>
    user = 'https://github.com/tjguk'

    bugs.python.org fields:

    activity = <Date 2021-03-31.22:47:06.662>
    actor = 'vstinner'
    assignee = 'tim.golden'
    closed = True
    closed_date = <Date 2021-03-31.12:08:34.105>
    closer = 'vstinner'
    components = ['Library (Lib)', 'Windows']
    creation = <Date 2019-08-25.17:33:21.926>
    creator = 'tim.golden'
    dependencies = []
    files = ['48561']
    hgrepos = []
    issue_num = 37945
    keywords = ['patch']
    message_count = 38.0
    messages = ['350466', '350470', '350471', '350485', '350491', '350510', '350548', '350549', '350559', '350568', '350569', '350571', '350573', '350574', '350598', '350820', '350823', '351788', '361804', '361806', '379068', '382739', '389005', '389657', '389661', '389784', '389839', '389867', '389883', '389884', '389885', '389886', '389888', '389890', '389891', '389892', '389938', '389940']
    nosy_count = 16.0
    nosy_names = ['terry.reedy', 'paul.moore', 'db3l', 'vstinner', 'tim.golden', 'jkloth', 'ned.deily', 'methane', 'guy.linton', 'lukasz.langa', 'zach.ware', 'serhiy.storchaka', 'eryksun', 'steve.dower', 'miss-islington', 'xtreak']
    pr_nums = ['25110', '25112', '25113']
    priority = None
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue37945'
    versions = ['Python 3.8', 'Python 3.9', 'Python 3.10']

    @tjguk
    Copy link
    Member Author

    tjguk commented Aug 25, 2019

    On a Win10 machine I'm consistently seeing test_locale (and test__locale) fail. I'll attach pythoninfo.

    ======================================================================
    ERROR: test_getsetlocale_issue1813 (test.test_locale.TestMiscellaneous)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "C:\Users\tim\work-in-progress\cpython\lib\test\test_locale.py", line 531, in test_getsetlocale_issue1813
        locale.setlocale(locale.LC_CTYPE, loc)
      File "C:\Users\tim\work-in-progress\cpython\lib\locale.py", line 604, in setlocale
        return _setlocale(category, locale)
    locale.Error: unsupported locale setting

    @tjguk tjguk added the 3.9 only security fixes label Aug 25, 2019
    @tjguk tjguk self-assigned this Aug 25, 2019
    @tjguk tjguk added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Aug 25, 2019
    @tjguk
    Copy link
    Member Author

    tjguk commented Aug 25, 2019

    Ok; so basically this doesn't work:

    <code>
    import locale
    locale.setlocale(locale.LC_CTYPE, locale.getdefaultlocale())
    </code>

    It gives "locale.Error: unsupported locale setting" which comes from https://github.com/python/cpython/blob/master/Modules/_localemodule.c#L107

    (For locale.getdefaultlocale() you could substitute locale.getlocale() or simply ("en_GB", "cp1252")). On my machine it raises that exception on Python 2.7.15, 3.6.6 and on master.

    Interestingly, none of the other tests in test_locale appear to exercise the 2-tuple 2nd param to setlocale. When you call setlocale and it returns the previous setting, it's a single string, eg "en_GB" etc. Passing that back in works. But when you call getlocale, it returns the 2-tuple, eg ("en_GB", "cp1252"). But all the other tests use the setlocale-returns-current trick for their setup/teardown.

    I've quickly tested on 3.5 on Linux and the 2-tuple version works ok. I assume it's working on buildbots or we'd see the Turkish test failing every time. So is there something different about my C runtime, I wonder?

    @tjguk
    Copy link
    Member Author

    tjguk commented Aug 25, 2019

    Just to save you looking, the code in https://github.com/python/cpython/blob/master/Modules/_localemodule.c#L107 converts the 2-tuple to lang.encoding form so the C module is seeing "en_GB.cp1252"

    @eryksun
    Copy link
    Contributor

    eryksun commented Aug 26, 2019

    local.normalize is generally wrong in Windows. It's meant for POSIX systems. Currently "tr_TR" is parsed as follows:

        >>> locale._parse_localename('tr_TR')
        ('tr_TR', 'ISO8859-9')

    The encoding "ISO8859-9" is meaningless to Windows. Also, the old CRT only ever supported either full language/country names or non-standard abbreviations -- e.g. either "Turkish_Turkey" or "trk_TUR". Having locale.getdefaultlocale() return ISO two-letter codes (e.g. "en_GB") was fundamentally wrong for the old CRT. (2.7 will die with this wart.)

    3.5+ uses the Universal CRT, which does support standard ISO codes, but only in BCP 47 [1] locale names of the following form:

    language           ISO 639
    ["-" script]       ISO 15924
    ["-" region]       ISO 3166-1
    

    BCP 47 locale names have been preferred by Windows for the past 13 years, since Vista was released. Windows extends BCP 47 with a non-standard sort-order field (e.g. "de-Latn-DE_phoneb" is the German language with Latin script in the region of Germany with phone-book sort order). Another departure from strict BCP 47 in Windows is allowing underscore to be used as the delimiter instead of hyphen.

    In a concession to existing C code, the Universal CRT also supports an encoding suffix in BCP 47 locales, but this can only be either ".utf-8" or ".utf8". (Windows itself does not support specifying an encoding in a locale name, but it's Unicode anyway.) No other encoding is allowed. If ".utf-8" isn't specified, a BCP 47 locale defaults to the locale's ANSI codepage. However, there's no way to convey this in the locale name itself. Also, if a locale is Unicode only (e.g. Hindi), the CRT implicitly uses UTF-8 even without the ".utf-8" suffix.

    The following are valid BCP 47 locale names in the CRT: "tr", "tr.utf-8", "tr-TR", "tr_TR", "tr_TR.utf8", or "tr-Latn-TR.utf-8". But note that "tr_TR.1254" is not supported.

    The following shows that omitting the optional "utf-8" encoding in a BCP 47 locale makes the CRT default to the associated ANSI codepage.

        >>> locale.setlocale(locale.LC_CTYPE, 'tr_TR')
        'tr_TR'
        >>> ucrt.___lc_codepage_func()
        1254

    C ___lc_codepage_func() queries the codepage of the current locale. We can directly query this codepage for a BCP 47 locale via GetLocaleInfoEx:

        >>> cpstr = (ctypes.c_wchar * 6)()
        >>> kernel32.GetLocaleInfoEx('tr-TR',
        ...     LOCALE_IDEFAULTANSICODEPAGE, cpstr, len(cpstr))
        5
        >>> cpstr.value
        '1254'

    If the result is '0', it's a Unicode-only locale (e.g. 'hi-IN' -- Hindi, India). Recent versions of the CRT use UTF-8 (codepage 65001) for Unicode-only locales:

        >>> locale.setlocale(locale.LC_CTYPE, 'hi-IN')
        'hi-IN'
        >>> ucrt.___lc_codepage_func()
        65001

    Here are some example locale tuples that should be supported, given that the CRT continues to support full English locale names and non-standard abbreviations, in addition to the new BCP 47 names:

    ('tr', None)
    ('tr_TR', None)
    ('tr_Latn_TR, None)
    ('tr_TR', 'utf-8')
    
    ('trk_TUR', '1254')
    ('Turkish_Turkey', '1254')
    

    The return value from C setlocale can be normalized to replace hyphen delimiters with underscores, and "utf8" can be normalized as "utf-8". If it's a BCP 47 locale that has no encoding, GetLocaleInfoEx can be called to query the ANSI codepage. UTF-8 can be assumed if it's a Unicode-only locale.

    As to prefixing a codepage with 'cp', we don't really need to do this. We have aliases defined for most, such as '1252' -> 'cp1252'. But if the 'cp' prefix does get added, then the locale module should at least know to remove it when building a locale name from a tuple.

    [1] https://tools.ietf.org/rfc/bcp/bcp47.txt

    @tjguk
    Copy link
    Member Author

    tjguk commented Aug 26, 2019

    Thanks, Eryk. Your explanation is as clear as always. But my question is, then: why is my machine failing this test [the only one which uses this two-part locale] and not the buildbots or (presumably) any other Windows developer?

    @eryksun
    Copy link
    Contributor

    eryksun commented Aug 26, 2019

    But my question is, then: why is my machine failing this test [the
    only one which uses this two-part locale] and not the buildbots or
    (presumably) any other Windows developer?

    test_getsetlocale_issue1813 fails for me as well. I can't imagine how setlocale(LC_CTYPE, "tr_TR.ISO8859-9") would succeed with recent versions of the Universal CRT in Windows. It parses "tr_TR" as a BCP 47 locale name, which only supports UTF-8 (e.g. "tr_TR.utf-8") and implicit ANSI (e.g. "tr_TR"). Plus "ISO8859-9" in general isn't a supported encoding of the form ".<codepage>", ".ACP" (ANSI), ".utf8", or ".utf-8".

    With the old CRT (2.x and <=3.4) and older versions of the Universal CRT, the initial locale.setlocale(locale.LC_CTYPE 'tr_TR') call fails as an unsupported locale, so the test is skipped:

    test_getsetlocale_issue1813 (__main__.TestMiscellaneous) ... skipped 'test needs Turkish locale'
    

    The old CRT only supports "trk_TUR", "trk_Turkey", "turkish_TUR", and "turkish_Turkey".

    @zooba
    Copy link
    Member

    zooba commented Aug 26, 2019

    So is the fix here to update locale._build_localename to check something like this?

    if encoding is None:
        return language
    elif sys.platform == 'win32' and encoding not in {'utf8', 'utf-8'}:
        return language
    else:
        return language + '.' + encoding

    @tjguk
    Copy link
    Member Author

    tjguk commented Aug 26, 2019

    I agree that that could be a fix. And certainly, if it turns out that this could never have (recently) worked as Eryk is suggesting, then let's go for it.

    But I still have this uneasy feeling that it's not failing on the buildbots and I can't see any sign of a skipped test in the test stdio. I just wonder whether there's something else at play here.

    @zooba
    Copy link
    Member

    zooba commented Aug 26, 2019

    I pushed a custom buildbot run that only runs this test in verbose mode, and it looks like the test is being skipped some other way?

    https://buildbot.python.org/all/#/builders/48/builds/36
    https://buildbot.python.org/all/#/builders/42/builds/54

    I don't see any evidence there that it's running at all, though I do on my own machine.

    Perhaps one of the other buildbot settings causes it to run in a different order and something skips the entire class? I haven't dug in enough to figure that out yet.

    @eryksun
    Copy link
    Contributor

    eryksun commented Aug 26, 2019

    We get into trouble with test_getsetlocale_issue1813 because normalize() maps "tr_TR" (supported) to "tr_TR.ISO8859-9" (not supported).

        >>> locale.normalize('tr_TR')
        'tr_TR.ISO8859-9'

    We should skip normalize() in Windows. It's based on a POSIX locale_alias mapping that can only cause problems. The work for normalizing locale names in Windows is best handled inline in _build_localename and _parse_localename.

    For the old long form, C setlocale always returns the codepage encoding (e.g. "Turkish_Turkey.1254") or "utf8", so that's simple to parse. For BCP 47 locales, the encoding is either "utf8" or "utf-8", or nothing at all. For the latter, there's an implied legacy ANSI encoding. This is used by the CRT wherever we depend on byte strings, such as in time.strftime:

    mojibake:

        >>> locale.setlocale(locale.LC_CTYPE, 'en_GB')
        'en_GB'
        >>> time.strftime("\u0100")
        'A'

    correct:

        >>> locale.setlocale(locale.LC_CTYPE, 'en_GB.utf-8')
        'en_GB.utf-8'
        >>> time.strftime("\u0100")
        'Ā'

    (We should switch back to using wcsftime if possible.)

    The implicit BCP-47 case can be parsed as None -- e.g. ("tr_TR", None). However, it might be useful to support getting the ANSI codepage via GetLocaleInfoEx [1]. A high-level function in locale could internally call _locale.getlocaleinfo(locale_name, LOCALE_IDEFAULTANSICODEPAGE). This would return a string such as "1254". or "0" for a Unicode-only language.

    For _build_localename, we can't simply limit the encoding to UTF-8. We need to support the old long/abbreviated forms (e.g. "trk_TUR", "turkish_Turkey") in addition to the newer BCP 47 locale names. In the old form we have to support the following encodings:

    * codepage encodings, with an optional "cp" prefix that has 
      to be stripped, e.g. ("trk_TUR", "cp1254") -> "trk_TUR.1254"
    * "ACP" in upper case only -- for the ANSI codepage of the 
      language
    * "utf8" (mixed case) and "utf-8" (mixed case)
    

    (The CRT documentation says "OEM" should also be supported, but it's not.)

    A locale name can also omit the language in the old form -- e.g. (None, "ACP") or (None, "cp1254"). The CRT uses the current language in this case. This is discouraged because the result may be nonsense.

    [1] https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getlocaleinfoex

    @zooba
    Copy link
    Member

    zooba commented Aug 26, 2019

    Oh yeah, that locale_alias table is useless on Windows :(

    But at least the function is documented in such a way that we can change it: "The returned locale code is formatted for use with :func:`setlocale`."

    Alternatively, we could make setlocale() do its own normalization step on Windows and ignore (or otherwise validate/reject) the encoding.

    None of that explains why the test doesn't seem to run at all on the buildbots though.

    @eryksun
    Copy link
    Contributor

    eryksun commented Aug 26, 2019

    None of that explains why the test doesn't seem to run at all on the
    buildbots though.

    Are the buildbots using an older version of UCRT? BCP 47 locales used to strictly require a hyphen as the delimiter (e.g. 'tr-TR') instead of underscore (e.g. 'tr_TR'). Supporting underscore and UTF-8 are relatively recent additions that aren't documented yet. Even WINAPI GetLocaleInfoEx supports underscore as the delimiter now, which is also undocumented behavior.

    @zooba
    Copy link
    Member

    zooba commented Aug 26, 2019

    test_getsetlocale_issue1813 (test.test_locale.TestMiscellaneous) ... skipped 'test needs Turkish locale'

    Yeah, looks like they're failing that part of the test. I'll run them again with the hyphen.

    @zooba
    Copy link
    Member

    zooba commented Aug 26, 2019

    Oh man, this is too broken for me to think about today...

    If someone feels like writing a Windows-specific normalize() function to totally replace the Unix one, feel free, but it looks like we won't be able to get away with anything less. The "easy" change breaks a variety of other tests.

    @tjguk
    Copy link
    Member Author

    tjguk commented Aug 27, 2019

    This feels like one of those changes where what's in place is clearly flawed but any change seems like it'll break stuff which people have had in place for years.

    I'll try to look at a least-breaking change but I'm honestly not sure what that would look like.

    @eryksun
    Copy link
    Contributor

    eryksun commented Aug 29, 2019

    Here's some additional background information for work on this issue.

    A Unix locale identifier has the following form:

    "language[_territory][.codeset][@modifier]"
        | "POSIX"
        | "C"
        | ""
        | NULL
    

    (X/Open Portability Guide, Issue 4, 1992 -- aka XPG4)

    Some systems also implement "C.UTF-8".

    The language and territory should use ISO 639 and ISO 3166 alpha-2 codes. The "@" modifier may indicate an alternate script such as "sr_RS@latin" or an alternate currency such as "de_DE@euro". For the optional codeset, IANA publishes the following table of character sets:

    http://www.iana.org/assignments/character-sets/character-sets.xhtml

    In Debian Linux, the available encodings are defined by mapping files in "/usr/share/i18n/charmaps". But encodings can't be arbitrarily used in locales at run time. A locale has to be generated (see "/etc/locale.gen") before it's available.

    A Windows (not ucrt) locale name has the following form:

    "ISO639Language[-ISO15924Script][-ISO3166Region][SubTag][_SortOrder]"
        | ""                      | LOCALE_NAME_INVARIANT
        | "!x-sys-default-locale" | LOCALE_NAME_SYSTEM_DEFAULT
        | NULL                    | LOCALE_NAME_USER_DEFAULT
    

    The invariant locale provides stable data. The system and user default locales vary according to the Control Panel "Region" settings.

    A locale name is based on BCP 47 language tags, with the form "<language>-<script>-<region>"(e.g. "en-Latn-GB"), for which the script and region codes are optional. The language is an ISO 639 alpha-2 or alpha-3 code, with alpha-2 preferred. The script is an initial-uppercase ISO 15924 code. The region is an ISO 3166-1 alpha-2 or numeric-3 code, with alpha-2 preferred.

    As specified, the sort-order code should be delimited by an underscore, but Windows 10 (maybe older versions also?) accepts a hyphen instead. Here's a list of the sort-order codes that I've seen:

    * mathan - Math Alphanumerics       ( x-IV_mathan)
    * phoneb - Phone Book               (de-DE_phoneb)
    * modern - Modern                   (ka-GE_modern)
    * tradnl - Traditional              (es-ES_tradnl)
    * technl - Technical                (hu-HU_technl)
    * radstr - Radical/Stroke           (ja-JP_radstr)
    * stroke - Stroke Count             (zh-CN_stroke)
    * pronun - Pronunciation (Bopomofo) (zh-TW_pronun)
    

    One final note of interest about Windows locales is that the user-interface language has been functionally isolated from the locale. The display language is handled by the Multilinugual User Interface (MUI) API, which depends on .mui files in locale-named subdirectories of a binary, such as "kernel32.dll" -> "en-US\kernel32.dll.mui". Windows 10 has an option to configure the user locale to match the preferred display language. This helps to keep the two in sync, but they're still functionally independent.

    The Universal CRT (ucrt) in Windows supports the following syntax for a locale identifier:

    "ISO639Language[-ISO15924Script][-ISO3166Region][.utf8|.utf-8]"
        | "ISO639Language[-ISO15924Script][-ISO3166Region][SubTag][_SortOrder]"
        | "language[_region][.codepage|.utf8|.utf-8]"
        | ".codepage" | ".utf8" | ".utf-8"
        | "C"
        | ""
        | NULL
    

    NULL is used with setlocale to query the current value of a category. The empty string "" is the current-user locale. "C" is a minimal locale. For LC_CTYPE, "C" uses Latin-1, but for LC_TIME it uses the system ANSI codepage (possibly multi-byte), which can lead to mojibake. The "POSIX" locale is not supported, nor is "C.UTF-8".

    Note that UTF-8 support is relatively new, as is the ability to set the encoding without also specifying a region (e.g. "english.utf8").

    Recent versions of ucrt extend BCP-47 support in a couple of ways. Underscore is allowed in addition to hyphen as the tag delimiter (e.g "en_GB" instead of "en-GB"), and specifying UTF-8 as the encoding (and only UTF-8) is supported. If UTF-8 isn't specified, internally the locale defaults to the language's ANSI codepage. ucrt has to parse BCP 47 locales manually if they include an encoding, and also in some cases when underscore is used. Currently this fails to handle a sort-order tag, so we can't use, for example, "de_DE_phoneb.utf8".

    @eryksun
    Copy link
    Contributor

    eryksun commented Aug 29, 2019

    If normalize() is implemented for Windows, then the tests should be split out into POSIX and Windows versions. Currently, most of the tests in NormalizeTest are not checking a result that's properly normalized for ucrt.

    A useful implementation of locale.normalize should allow a script to use ("en_US", "iso8859_1") in Windows without having to know that Latin-1 is Windows codepage 28591, or that ucrt requires a classic locale name if the encoding isn't UTF-8. The required result for setlocale() is "English_United States.28591".

    As far as aliases are concerned, at a minimum, we need to map "posix" and "c" to "C". We can also support "C.UTF-8" as "en_US.UTF-8". Do we need to support the Unix locale_alias mappings from X.org? If so, I suppose we could use a double mapping. First try the Unix locale_alias mapping. Then try that result in a windows_locale_alias mapping that includes additional mappings from Unix to Windows. For example:

    sr_CS.UTF-8          -> sr_Cyrl_CS.UTF-8
    sr_CS.UTF-8@latin    -> sr_Latn_CS.UTF-8
    ca_ES.UTF-8@valencia -> ca_ES_valencia.UTF-8
    

    Note that the last one doesn't currently work. "ca-ES-valencia" is a valid Windows locale name for the Valencian variant of Catalan (ca), which lacks an ISO 639 code of its own since it's officially (and somewhat controversially) designated as a dialect of Catalan. This is an unusual case that has a subtag after the region, which ucrt's manual BCP-47 parsing cannot handle. (It tries to parse "ES" as the script and "valencia" as an ISO 3166-1 country code.)

    After mapping aliases, if the result still has "@" in it, normalize() should fail. We don't know what the "@" modifier means.

    Otherwise, split the locale name and encoding parts. If the encoding isn't UTF-8, try to map it to a codepage. For this we need a windows_codepage_alias dict that maps IANA official and Python-specific encoding names to Windows codepages. Next, check the locale name via WINAPI IsValidLocaleName. If it's not valid, try replacing underscore with hyphen and check again. Otherwise assume it's a classic ucrt locale name. (It may not be valid, but implementing all of the work ucrt does to parse a classic locale name is too much I think.) If it's a valid Windows locale name, and we have a codepage encoding, then try to translate it as a classic ucrt locale name. This requires two WINAPI GetLocaleInfoEx calls to look up the English versions of the language and country name.

    @zooba
    Copy link
    Member

    zooba commented Sep 11, 2019

    FYI I just closed bpo-10466 as a duplicate (even though that one's been around longer, this issue has more relevant information on it).

    @vstinner vstinner changed the title test_locale failing [Windows] locale.getdefaultlocale() issues on Windows: test_locale.test_getsetlocale_issue1813() Sep 30, 2019
    @vstinner
    Copy link
    Member

    I marked bpo-38324 as a duplicate of this issue.

    @vstinner
    Copy link
    Member

    For me, the locale.getlocale() function is broken. Python attempts to guess too many things about the current locale. For example, it relies on an hard coded locale.locale_encoding_alias dictionary which seems to come from X11!? In 2020, Wayland replaced X11 and locales are updated frequently. This dictionary makes no sense on Windows. For example, 'ka_ge' is mapped to 'ka_GE.GEORGIAN-ACADEMY': "GEORGIAN-ACADEMY" is not an encoding, what is the purpose of this string?

    I fixed dozens and dozens of locale issues and I never ever used locale.getlocale(). I never understood its purpose nor how it guess the encoding.

    I always use locale.setlocale(category) or locale.setlocale(category, None) which returns a simply string which I can pass back to locale.setlocale(category, string) to restore the locale, when I need to temporarily change a locale.

    @terryjreedy
    Copy link
    Member

    test__locale (bpo-38324) also passed CI and buildbots while failing locally.

    @terryjreedy terryjreedy added the 3.10 only security fixes label Oct 20, 2020
    @terryjreedy
    Copy link
    Member

    This seemingly useless test is the only test failure for me with installed 3.9.1. Why keep it, at least on Windows.

    The failure with "locale.Error: unsupported locale setting" is not limited to Windows. bpo-25191 and duplicate bpo-31636 report the same error on user machines (non-buildbot, as here). #25191proposes the following patch to skip this failure.

    -        locale.setlocale(locale.LC_CTYPE, loc)
    -        self.assertEqual(loc, locale.getlocale(locale.LC_CTYPE))
    +        try:
    +            locale.setlocale(locale.LC_CTYPE, loc)
    +            self.assertEqual(loc, locale.getlocale(locale.LC_CTYPE))
    +        except locale.Error:
    +            # Unsupported locale setting
    +            self.skipTest('unsupported locale setting')

    I believe that this is effectively the same as deleting the test. But if we believe it is being skipped on at least Windows buildbots, then we should do the same at least on user Windows machines. Or, if we could detect user manchines, skip on them.

    @terryjreedy terryjreedy changed the title [Windows] locale.getdefaultlocale() issues on Windows: test_locale.test_getsetlocale_issue1813() test_locale.TestMiscellaneous.test_getsetlocale_issue1813() fails Dec 8, 2020
    @vstinner vstinner changed the title test_locale.TestMiscellaneous.test_getsetlocale_issue1813() fails [Windows] test_locale.TestMiscellaneous.test_getsetlocale_issue1813() fails Dec 14, 2020
    @vstinner
    Copy link
    Member

    "ERROR: test_getsetlocale_issue1813 (test.test_locale.TestMiscellaneous)" fails on the Windows x64 job of GitHub Actions when Python is built in debug mode:
    #24914

    @db3l
    Copy link
    Contributor

    db3l commented Mar 29, 2021

    The test has also begun failing on the Win10 buildbot (after updating to 20H2 from an older 1803).

    @terryjreedy
    Copy link
    Member

    So, can we delete it?

    PR 19781 is for bpo-43510 and is listed here above only because this issue is mentioned.

    @db3l
    Copy link
    Contributor

    db3l commented Mar 30, 2021

    I don't have much of a horse in the race, but since the test has historically been skipped on Windows, and the test hasn't and doesn't work on Windows, modifications to restore the skip behavior seem reasonable to me. The trigger for this issue was Windows adding support for underscore in locale names (like tr_TR) so the test began executing. But it's not a regression or new issue, it's just existing reality becoming exposed.

    The user machine and buildbot discrepancy can be attributed to version differences, as the buildbot hadn't yet received the same underscore locale name support.

    I'd be fine with removing the test entirely - always skipping on a failure just seems pointless. Then again, bpo-1813 created the test for a purpose on other systems, though even back then it appears it was complicated. Leaving the test but skipping known failing systems (I guess at least Windows and OpenBSD) might be slightly less intrusive of a change, assuming the test is still serving a purpose elsewhere.

    Separately, there's a lot of useful/interesting detail here which could inform any eventual normalization changes on Windows, should the underlying issue be deemed worthy of addressing. But that seems like something that could be a distinct operation from clearing up the test issue.

    @zooba
    Copy link
    Member

    zooba commented Mar 30, 2021

    This is now holding up some security releases (due to a couple of CVEs). Can we get the test skipped or fixed asap, please?

    @db3l
    Copy link
    Contributor

    db3l commented Mar 31, 2021

    In lieu of the patch in bpo-25191, what about a pair of skips to deal with the issues at hand without killing the test entirely? I'm including OpenBSD since those issues were closed in favor of this one, and am assuming that skipping there is also appropriate.

    --- a/Lib/test/test_locale.py
    +++ b/Lib/test/test_locale.py
    @@ -552,6 +552,10 @@ def test_setlocale_category(self):
             # crasher from bug python/cpython#51668
             self.assertRaises(locale.Error, locale.setlocale, 12345)
     
    +    @unittest.skipIf(sys.platform == 'win32',
    +                     "Test broken on Windows (issue python/cpython#82126)")
    +    @unittest.skipIf(sys.platform.startswith('openbsd'),
    +                     "Test broken on OpenBSD (issues python/cpython#75817 and python/cpython#69378)")
         def test_getsetlocale_issue1813(self):
             # Issue python/cpython#46138: setting and getting the locale under a Turkish locale
             oldlocale = locale.setlocale(locale.LC_CTYPE)

    @vstinner
    Copy link
    Member

    On Windows 10 build 1903,

    vstinner@WIN C:\vstinner\python\master>python -m test test_locale -m test_getsetlocale_issue1813 -v
    == CPython 3.10.0a6+ (heads/master:ff3c9739bd, Mar 31 2021, 12:43:26) [MSC v.1916 64 bit (AMD64)]
    == Windows-10-10.0.18362-SP0 little-endian
    (...)
    ======================================================================
    ERROR: test_getsetlocale_issue1813 (test.test_locale.TestMiscellaneous)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "C:\vstinner\python\master\lib\test\test_locale.py", line 567, in test_getsetlocale_issue1813
        locale.setlocale(locale.LC_CTYPE, loc)
      File "C:\vstinner\python\master\lib\locale.py", line 610, in setlocale
        return _setlocale(category, locale)
    locale.Error: unsupported locale setting

    It's a bug in the weird locale.getlocale() function which produces a locale name which doesn't exist:

    vstinner@WIN C:\vstinner\python\master>python
    >>> import locale
    >>> locale.setlocale(locale.LC_CTYPE, "tr_TR") 
    'tr_TR'
    >>> locale.setlocale(locale.LC_CTYPE, None)
    'tr_TR'
    >>> locale.getlocale(locale.LC_CTYPE)           
    ('tr_TR', 'ISO8859-9')
    
    
    >>> locale.setlocale(locale.LC_CTYPE, ('tr_TR', 'ISO8859-9'))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "C:\vstinner\python\master\lib\locale.py", line 610, in setlocale
        return _setlocale(category, locale)
    locale.Error: unsupported locale setting
    
    >>> locale.setlocale(locale.LC_CTYPE, 'tr_TR')
    'tr_TR'

    If you use setlocale(LC_CTYPE, None) to get the locale, it works as expected.

    IMO the getlocale() function is dangerous and should be removed: see bpo-43557 "Deprecate getdefaultlocale(), getlocale() and normalize() functions".

    @vstinner
    Copy link
    Member

    I wrote PR 25110 to simply skip the test if setlocale() fails. It fix the issue on Windows (I tested manually, see my comment on my PR), but it should also fix the issue on OpenBSD and any platform where getlocale() returns a locale not accepted by setlocale().

    Again, don't ust getlocale(category) but setlocale(category, None).

    @ambv
    Copy link
    Contributor

    ambv commented Mar 31, 2021

    Yeah, I'm making the change David suggested. It applies to 3.8 as well.

    @ambv ambv added the 3.8 only security fixes label Mar 31, 2021
    @ambv
    Copy link
    Contributor

    ambv commented Mar 31, 2021

    Oh, Victor's solution is fine as well.

    @ambv
    Copy link
    Contributor

    ambv commented Mar 31, 2021

    New changeset f3ab670 by Victor Stinner in branch 'master':
    bpo-37945: Fix test_locale.test_getsetlocale_issue1813() (bpo-25110)
    f3ab670

    @ambv
    Copy link
    Contributor

    ambv commented Mar 31, 2021

    New changeset fabdd25 by Miss Islington (bot) in branch '3.9':
    bpo-37945: Fix test_locale.test_getsetlocale_issue1813() (GH-25110) (GH-25112)
    fabdd25

    @ambv
    Copy link
    Contributor

    ambv commented Mar 31, 2021

    New changeset e143eea by Miss Islington (bot) in branch '3.8':
    bpo-37945: Fix test_locale.test_getsetlocale_issue1813() (GH-25110) (GH-25113)
    e143eea

    @vstinner
    Copy link
    Member

    Ok, the initial issue has been fixed: test_locale pass again on Windows.

    Let's continue the discussion on getlocale() in bpo-43557 "Deprecate getdefaultlocale(), getlocale() and normalize() functions" ;-)

    @terryjreedy
    Copy link
    Member

    Great! For the first time in over 2 years, the test suite passes on a Windows repository build on my machine. I will test installed 3.10 after the next alpha release. (3.10.0a7 has other failures as well.)

    @vstinner
    Copy link
    Member

    Great! For the first time in over 2 years, the test suite passes on a Windows repository build on my machine.

    Nice :-)

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes 3.9 only security fixes 3.10 only security fixes OS-windows stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    8 participants