Message 387588 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	eryksun
Recipients	AndersMunch, eryksun, lemburg, paul.moore, steve.dower, swt2c, tim.golden, zach.ware
Date	2021-02-23.17:59:57
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1614103197.58.0.562115514326.issue43115@roundup.psfhosted.org>
In-reply-to

Content
> All getlocale is used for in _strptime.py is comparing the value > returned to the previous value returned. Which is why _strptime should be calling setlocale(LC_TIME), the same as the calendar module. That's not to say that I don't think getlocale() and normalize() need to be fixed. But returning None for the encoding when there's no codeset, while it works for a few cases, doesn't address many other cases. For example, normalize() and getlocale() will often be wrong in many cases when an encoding is guessed, such as normalize('en_US') -> 'en_US.ISO8859-1' and normalize('ja_JP') -> 'ja_JP.eucJP'. The encoding is wrong, plus no encoding except UTF-8 is allowed in a BCP-47 locale, so setlocale() will fail. In all but four cases, classic ucrt "language_country.codepage" locales such as "Japanese_Japan.932" are parsed in a 'benignly' incorrect way (i.e. not as RFC 1766 language tags), which at least roundtrips with setlocale(). It simply splits out the codeset, e.g. ('Japanese_Japan', '932'). The four misbehaving cases are actually the ones for which getlocale() works as documented because the locale_alias mapping has an entry for them. * "French_France.1252" -> ('fr_FR', 'cp1252') * "German_Germany.1252" -> ('de_DE', 'cp1252') * "Portuguese_Brazil.1252" -> ('pt_BR', 'cp1252') * "Spanish_Spain.1252" -> ('es_ES', 'cp1252') The problem is that the parsed tuples don't roundtrip because normalize() keeps the encoding, complete with the 'cp' prefix, and only UTF-8 is allowed in a BCP-47 locale. For example: >>> locale.setlocale(locale.LC_CTYPE, 'French_France.1252') 'French_France.1252' >>> locale.getlocale() ('fr_FR', 'cp1252') >>> locale.normalize(locale._build_localename(locale.getlocale())) 'fr_FR.cp1252' >>> try: locale.setlocale(locale.LC_CTYPE, locale.getlocale()) ... except locale.Error as e: print(e) ... unsupported locale setting I suppose normalize() could be special cased in Windows to look for a BCP-47 locale and omit the encoding if it's not UTF-8. I suppose _parse_localename() could be special cased to use None for the encoding if there's no codeset. But this just leaves me feeling unsettled and disappointed that we could be doing a better job of providing the documented behavior of getlocale() and normalize() by implementing them separately for Windows using the tools that the OS provides. FYI, I've commented on this problem across a few issues, including bpo-20088 and bpo-23425 in early 2015, and then extensively in bpo-37945 in mid 2019. Plus my latest comments in msg387256 in this issue. The latter suggestions could be combined with something like the mapping that's generated by the code in msg235937 in bpo-23425, in order to parse a classic ucrt locale string such as "Japanese_Japan.932" properly as ("ja_JP", "cp932"), and then build and normalize it back as "Japanese_Japan.932".

> All getlocale is used for in _strptime.py is comparing the value 
> returned to the previous value returned.

Which is why _strptime should be calling setlocale(LC_TIME), the same as the calendar module. That's not to say that I don't think getlocale() and normalize() need to be fixed. But returning None for the encoding when there's no codeset, while it works for a few cases, doesn't address many other cases.

For example, normalize() and getlocale() will often be wrong in many cases when an encoding is guessed, such as normalize('en_US') -> 'en_US.ISO8859-1' and normalize('ja_JP') -> 'ja_JP.eucJP'. The encoding is wrong, plus no encoding except UTF-8 is allowed in a BCP-47 locale, so setlocale() will fail. 

In all but four cases, classic ucrt "language_country.codepage" locales such as "Japanese_Japan.932" are parsed in a 'benignly' incorrect way (i.e. not as RFC 1766 language tags), which at least roundtrips with setlocale(). It simply splits out the codeset, e.g. ('Japanese_Japan', '932'). The four misbehaving cases are actually the ones for which getlocale() works as documented because the locale_alias mapping has an entry for them.

    * "French_France.1252" -> ('fr_FR', 'cp1252')
    * "German_Germany.1252" -> ('de_DE', 'cp1252')
    * "Portuguese_Brazil.1252" -> ('pt_BR', 'cp1252')
    * "Spanish_Spain.1252" -> ('es_ES', 'cp1252')

The problem is that the parsed tuples don't roundtrip because normalize() keeps the encoding, complete with the 'cp' prefix, and only UTF-8 is allowed in a BCP-47 locale. For example:

    >>> locale.setlocale(locale.LC_CTYPE, 'French_France.1252')
    'French_France.1252'
    >>> locale.getlocale()
    ('fr_FR', 'cp1252')
    >>> locale.normalize(locale._build_localename(locale.getlocale()))
    'fr_FR.cp1252'
    >>> try: locale.setlocale(locale.LC_CTYPE, locale.getlocale())
    ... except locale.Error as e: print(e)
    ...
    unsupported locale setting

I suppose normalize() could be special cased in Windows to look for a BCP-47 locale and omit the encoding if it's not UTF-8. I suppose _parse_localename() could be special cased to use None for the encoding if there's no codeset. But this just leaves me feeling unsettled and disappointed that we could be doing a better job of providing the documented behavior of getlocale() and normalize() by implementing them separately for Windows using the tools that the OS provides.

FYI, I've commented on this problem across a few issues, including bpo-20088 and bpo-23425 in early 2015, and then extensively in bpo-37945 in mid 2019. Plus my latest comments in msg387256 in this issue. The latter suggestions could be combined with something like the mapping that's generated by the code in msg235937 in bpo-23425, in order to parse a classic ucrt locale string such as "Japanese_Japan.932" properly as ("ja_JP", "cp932"), and then build and normalize it back as "Japanese_Japan.932".

History
Date	User	Action	Args
2021-02-23 17:59:57	eryksun	set	recipients: + eryksun, lemburg, paul.moore, tim.golden, zach.ware, steve.dower, swt2c, AndersMunch
2021-02-23 17:59:57	eryksun	set	messageid: <1614103197.58.0.562115514326.issue43115@roundup.psfhosted.org>
2021-02-23 17:59:57	eryksun	link	issue43115 messages
2021-02-23 17:59:57	eryksun	create