classification
Title: Non-ascii Windows locale names
Type: behavior Stage:
Components: Unicode, Windows Versions: Python 3.8, Python 3.7
process
Status: open Resolution: third party
Dependencies: Superseder:
Assigned To: Nosy List: eryksun, ezio.melotti, paul.moore, serhiy.storchaka, steve.dower, tim.golden, vidartf, vstinner, zach.ware
Priority: normal Keywords:

Created on 2016-01-06 14:28 by vidartf, last changed 2019-02-11 14:58 by steve.dower.

Messages (8)
msg257608 - (view) Author: Vidar Fauske (vidartf) * Date: 2016-01-06 14:28
The Norwegian locale on Windows has the honor of having the only locale name with a non-ASCII character ('Norwegian Bokmål_Norway', see e.g. https://wiki.postgresql.org/wiki/Changes_To_Norwegian_Locale). It does not seem like python 3 is able to handle this properly, as the following code demonstrates:

>python
Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:44:40) [MSC v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.setlocale(locale.LC_TIME, 'swedish')
'Swedish_Sweden.1252'
>>> loc_sw = locale.getlocale(locale.LC_TIME)
>>> locale.setlocale(locale.LC_TIME, 'norwegian')
'Norwegian Bokmål_Norway.1252'
>>> loc_no = locale.getlocale(locale.LC_TIME)
>>> locale.setlocale(locale.LC_TIME, loc_sw)
'Swedish_Sweden.1252'
>>> locale.setlocale(locale.LC_TIME, loc_no)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\prog\WinPython-64bit-3.4.3.7\python-3.4.3.amd64\lib\locale.py", line 593, in setlocale
    return _setlocale(category, locale)
locale.Error: unsupported locale setting


As can be seen, this can be worked around when setting the locale manually, but if the locale has already been set to Norwegian, the value returned from getlocale is invalid when passed to setlocale.

Following the example of postgres in the link above, I suggest changing the behavior of locale.getlocale to alias 'Norwegian Bokmål_Norway.1252' as 'Norwegian_Norway.1252', which is completely ASCII, and therefore fine.
msg257611 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-01-06 15:28
This can be related to issue25812. Python supposes that locale settings in all categories use the same encoding (set by LC_CTYPE). Try first to set LC_CTYPE to ASCII-named locale with the 1252 codepage.
msg257612 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-01-06 15:54
PyLocale_setlocale in Modules/_localemodule.c is incorrectly passing the locale as a UTF-8 string ("z") instead of using the codepage of the current locale. 

As you can see below "å" is passed as the UTF-8 string "\xc3\xa5":

    >>> locale._setlocale(locale.LC_TIME, 'Norwegian Bokmål_Norway.1252')
    Breakpoint 0 hit
    MSVCR100!setlocale:
    00000000`56d23d14 48895c2408      mov     qword ptr [rsp+8],rbx
                                              ss:00000000`004af800=
                                              0000000002ad2a68
    0:000> db @rdx l0n29
    00000000`02808910  4e 6f 72 77 65 67 69 61-
                       6e 20 42 6f 6b 6d c3 a5  Norwegian Bokm..
    00000000`02808920  6c 5f 4e 6f 72 77 61 79-
                       2e 31 32 35 32           l_Norway.1252

The CRT's setlocale works fine when passed the locale string encoded with codepage 1252:

    >>> msvcr100 = ctypes.CDLL('msvcr100')
    >>> msvcr100.setlocale.restype = ctypes.c_char_p
    >>> loc_no = 'Norwegian Bokmål_Norway.1252'.encode('1252')
    >>> msvcr100.setlocale(locale.LC_TIME, loc_no)
    b'Norwegian Bokm\xe5l_Norway.1252'
msg257613 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-01-06 16:13
> PyLocale_setlocale in Modules/_localemodule.c is incorrectly passing the locale as a UTF-8 string ("z") instead of using the codepage of the current locale. 

Do you mean that the function must encode the locale name to the *ANSI codepage*?
msg257625 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-01-06 18:09
Yes, it's ANSI. I should have said "system locale" instead of "current locale". To find the requested locale, the CRT function __get_qualified_locale calls EnumSystemLocalesA. The passed callback calls GetLocaleInfoA for each enumerated locale to get the country (SENGLISHCOUNTRYNAME) and language (SENGLISHLANGUAGENAME).
msg257808 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-01-09 09:20
The issue isn't quite the same for 3.5+. The new CRT uses Windows Vista locale APIs. In this case it uses LOCALE_SENGLISHLANGUAGENAME instead of the old LOCALE_SENGLANGUAGE. This maps "Norwegian" to simply "Norwegian" instead of "Norwegian Bokmål":

    >>> locale.setlocale(locale.LC_TIME, 'norwegian')
    'Norwegian_Norway.1252'

The "Norwegian Bokmål" language name has to be requested explicitly to see the same problem:

    >>> try: locale.setlocale(locale.LC_TIME, 'Norwegian Bokmål')
    ... except Exception as e: print(e)
    ...
    unsupported locale setting

The fix for 3.4 would be to encode the locale string using PyUnicode_AsMBCSString (ANSI). It's too late, however, since 3.4 is no longer getting bug fixes.

For 3.5+, setlocale could either switch to using _wsetlocale on Windows or call setlocale with the string encoded via Py_EncodeLocale (wcstombs). Encoding the string via wcstombs is required because the new CRT roundtrips the conversion via mbstowcs before forwarding the call to _wsetlocale. This means that success depends on the current LC_CTYPE, unless Python switches to calling _wsetlocale directly.

As a workaround for 3.5+, the new CRT also supports RFC 4646 language-tag locales when running on Vista or later. For example, "Norwegian Bokmål"  is simply "nb". 

Language-tag locales differ from POSIX locales. Superficially, they use "-" instead of "_" as the delimiter. More importantly, they don't allow explicitly setting the codeset. Instead of a .codeset, they use ISO 15924 script codes. Specifying a script may select a different ANSI codepage. It depends on whether there's an NLS definition for the language-script combination. For example, Bosnian can be written using either Latin or Cyrillic. Thus the "bs-BA" and "bs-Latn-BA" locales use the Central Europe codepage 1250, but "bs-Cyrl-BA" uses the Cyrillic codepage 1251. On the other hand, "en-Cyrl-US" still uses the Latin codepage 1252.

As a separate issue, language-tag locales break the parsing in locale.getlocale:

    >>> locale.setlocale(locale.LC_TIME, 'nb-NO')
    'nb-NO'
    >>> try: locale.getlocale(locale.LC_TIME)
    ... except Exception as e: print(e)
    ...
    unknown locale: nb-NO

    >>> locale.setlocale(locale.LC_CTYPE, 'bs-Cyrl-BA')
    'bs-Cyrl-BA'
    >>> try: locale.getlocale(locale.LC_CTYPE)
    ... except Exception as e: print(e)
    ...
    unknown locale: bs-Cyrl-BA
msg335204 - (view) Author: Vidar Fauske (vidartf) * Date: 2019-02-11 09:58
This issue can still be triggered for Python 3.7 by the following line (running on a Windows machine with a Norwegian locale as default):

locale.setlocale(locale.LC_ALL, locale.getdefaultlocale())
msg335227 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-02-11 14:58
We should switch to _wsetlocale, or else come up with a more sensible mapping that makes sense between platforms (like we have for encodings already).

I suspect the latter requires proper design and discussion, so it's worth doing the first part immediately.
History
Date User Action Args
2019-02-11 14:58:33steve.dowersetmessages: + msg335227
versions: + Python 3.8, - Python 3.5, Python 3.6
2019-02-11 09:58:19vidartfsetmessages: + msg335204
versions: + Python 3.7
2016-01-09 09:20:05eryksunsetresolution: third party
messages: + msg257808
2016-01-08 18:35:13terry.reedysetversions: + Python 3.5, Python 3.6, - Python 3.4
2016-01-06 18:09:08eryksunsetmessages: + msg257625
2016-01-06 16:13:46vstinnersetmessages: + msg257613
2016-01-06 15:54:55eryksunsetnosy: + eryksun
messages: + msg257612
2016-01-06 15:28:16serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg257611
2016-01-06 14:28:10vidartfcreate