Author eryksun
Recipients AndersMunch, eryksun, lemburg, paul.moore, steve.dower, swt2c, tim.golden, zach.ware
Date 2021-02-23.17:59:57
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1614103197.58.0.562115514326.issue43115@roundup.psfhosted.org>
In-reply-to
Content
> All getlocale is used for in _strptime.py is comparing the value 
> returned to the previous value returned.

Which is why _strptime should be calling setlocale(LC_TIME), the same as the calendar module. That's not to say that I don't think getlocale() and normalize() need to be fixed. But returning None for the encoding when there's no codeset, while it works for a few cases, doesn't address many other cases.

For example, normalize() and getlocale() will often be wrong in many cases when an encoding is guessed, such as normalize('en_US') -> 'en_US.ISO8859-1' and normalize('ja_JP') -> 'ja_JP.eucJP'. The encoding is wrong, plus no encoding except UTF-8 is allowed in a BCP-47 locale, so setlocale() will fail. 

In all but four cases, classic ucrt "language_country.codepage" locales such as "Japanese_Japan.932" are parsed in a 'benignly' incorrect way (i.e. not as RFC 1766 language tags), which at least roundtrips with setlocale(). It simply splits out the codeset, e.g. ('Japanese_Japan', '932'). The four misbehaving cases are actually the ones for which getlocale() works as documented because the locale_alias mapping has an entry for them.

    * "French_France.1252" -> ('fr_FR', 'cp1252')
    * "German_Germany.1252" -> ('de_DE', 'cp1252')
    * "Portuguese_Brazil.1252" -> ('pt_BR', 'cp1252')
    * "Spanish_Spain.1252" -> ('es_ES', 'cp1252')

The problem is that the parsed tuples don't roundtrip because normalize() keeps the encoding, complete with the 'cp' prefix, and only UTF-8 is allowed in a BCP-47 locale. For example:

    >>> locale.setlocale(locale.LC_CTYPE, 'French_France.1252')
    'French_France.1252'
    >>> locale.getlocale()
    ('fr_FR', 'cp1252')
    >>> locale.normalize(locale._build_localename(locale.getlocale()))
    'fr_FR.cp1252'
    >>> try: locale.setlocale(locale.LC_CTYPE, locale.getlocale())
    ... except locale.Error as e: print(e)
    ...
    unsupported locale setting

I suppose normalize() could be special cased in Windows to look for a BCP-47 locale and omit the encoding if it's not UTF-8. I suppose _parse_localename() could be special cased to use None for the encoding if there's no codeset. But this just leaves me feeling unsettled and disappointed that we could be doing a better job of providing the documented behavior of getlocale() and normalize() by implementing them separately for Windows using the tools that the OS provides.

FYI, I've commented on this problem across a few issues, including bpo-20088 and bpo-23425 in early 2015, and then extensively in bpo-37945 in mid 2019. Plus my latest comments in msg387256 in this issue. The latter suggestions could be combined with something like the mapping that's generated by the code in msg235937 in bpo-23425, in order to parse a classic ucrt locale string such as "Japanese_Japan.932" properly as ("ja_JP", "cp932"), and then build and normalize it back as "Japanese_Japan.932".
History
Date User Action Args
2021-02-23 17:59:57eryksunsetrecipients: + eryksun, lemburg, paul.moore, tim.golden, zach.ware, steve.dower, swt2c, AndersMunch
2021-02-23 17:59:57eryksunsetmessageid: <1614103197.58.0.562115514326.issue43115@roundup.psfhosted.org>
2021-02-23 17:59:57eryksunlinkissue43115 messages
2021-02-23 17:59:57eryksuncreate