Message 257808 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	eryksun
Recipients	eryksun, ezio.melotti, paul.moore, serhiy.storchaka, steve.dower, tim.golden, vidartf, vstinner, zach.ware
Date	2016-01-09.09:20:04
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1452331205.77.0.181792411864.issue26024@psf.upfronthosting.co.za>
In-reply-to

Content
The issue isn't quite the same for 3.5+. The new CRT uses Windows Vista locale APIs. In this case it uses LOCALE_SENGLISHLANGUAGENAME instead of the old LOCALE_SENGLANGUAGE. This maps "Norwegian" to simply "Norwegian" instead of "Norwegian Bokmål": >>> locale.setlocale(locale.LC_TIME, 'norwegian') 'Norwegian_Norway.1252' The "Norwegian Bokmål" language name has to be requested explicitly to see the same problem: >>> try: locale.setlocale(locale.LC_TIME, 'Norwegian Bokmål') ... except Exception as e: print(e) ... unsupported locale setting The fix for 3.4 would be to encode the locale string using PyUnicode_AsMBCSString (ANSI). It's too late, however, since 3.4 is no longer getting bug fixes. For 3.5+, setlocale could either switch to using _wsetlocale on Windows or call setlocale with the string encoded via Py_EncodeLocale (wcstombs). Encoding the string via wcstombs is required because the new CRT roundtrips the conversion via mbstowcs before forwarding the call to _wsetlocale. This means that success depends on the current LC_CTYPE, unless Python switches to calling _wsetlocale directly. As a workaround for 3.5+, the new CRT also supports RFC 4646 language-tag locales when running on Vista or later. For example, "Norwegian Bokmål" is simply "nb". Language-tag locales differ from POSIX locales. Superficially, they use "-" instead of "_" as the delimiter. More importantly, they don't allow explicitly setting the codeset. Instead of a .codeset, they use ISO 15924 script codes. Specifying a script may select a different ANSI codepage. It depends on whether there's an NLS definition for the language-script combination. For example, Bosnian can be written using either Latin or Cyrillic. Thus the "bs-BA" and "bs-Latn-BA" locales use the Central Europe codepage 1250, but "bs-Cyrl-BA" uses the Cyrillic codepage 1251. On the other hand, "en-Cyrl-US" still uses the Latin codepage 1252. As a separate issue, language-tag locales break the parsing in locale.getlocale: >>> locale.setlocale(locale.LC_TIME, 'nb-NO') 'nb-NO' >>> try: locale.getlocale(locale.LC_TIME) ... except Exception as e: print(e) ... unknown locale: nb-NO >>> locale.setlocale(locale.LC_CTYPE, 'bs-Cyrl-BA') 'bs-Cyrl-BA' >>> try: locale.getlocale(locale.LC_CTYPE) ... except Exception as e: print(e) ... unknown locale: bs-Cyrl-BA

The issue isn't quite the same for 3.5+. The new CRT uses Windows Vista locale APIs. In this case it uses LOCALE_SENGLISHLANGUAGENAME instead of the old LOCALE_SENGLANGUAGE. This maps "Norwegian" to simply "Norwegian" instead of "Norwegian Bokmål":

    >>> locale.setlocale(locale.LC_TIME, 'norwegian')
    'Norwegian_Norway.1252'

The "Norwegian Bokmål" language name has to be requested explicitly to see the same problem:

    >>> try: locale.setlocale(locale.LC_TIME, 'Norwegian Bokmål')
    ... except Exception as e: print(e)
    ...
    unsupported locale setting

The fix for 3.4 would be to encode the locale string using PyUnicode_AsMBCSString (ANSI). It's too late, however, since 3.4 is no longer getting bug fixes.

For 3.5+, setlocale could either switch to using _wsetlocale on Windows or call setlocale with the string encoded via Py_EncodeLocale (wcstombs). Encoding the string via wcstombs is required because the new CRT roundtrips the conversion via mbstowcs before forwarding the call to _wsetlocale. This means that success depends on the current LC_CTYPE, unless Python switches to calling _wsetlocale directly.

As a workaround for 3.5+, the new CRT also supports RFC 4646 language-tag locales when running on Vista or later. For example, "Norwegian Bokmål"  is simply "nb". 

Language-tag locales differ from POSIX locales. Superficially, they use "-" instead of "_" as the delimiter. More importantly, they don't allow explicitly setting the codeset. Instead of a .codeset, they use ISO 15924 script codes. Specifying a script may select a different ANSI codepage. It depends on whether there's an NLS definition for the language-script combination. For example, Bosnian can be written using either Latin or Cyrillic. Thus the "bs-BA" and "bs-Latn-BA" locales use the Central Europe codepage 1250, but "bs-Cyrl-BA" uses the Cyrillic codepage 1251. On the other hand, "en-Cyrl-US" still uses the Latin codepage 1252.

As a separate issue, language-tag locales break the parsing in locale.getlocale:

    >>> locale.setlocale(locale.LC_TIME, 'nb-NO')
    'nb-NO'
    >>> try: locale.getlocale(locale.LC_TIME)
    ... except Exception as e: print(e)
    ...
    unknown locale: nb-NO

    >>> locale.setlocale(locale.LC_CTYPE, 'bs-Cyrl-BA')
    'bs-Cyrl-BA'
    >>> try: locale.getlocale(locale.LC_CTYPE)
    ... except Exception as e: print(e)
    ...
    unknown locale: bs-Cyrl-BA

History
Date	User	Action	Args
2016-01-09 09:20:05	eryksun	set	recipients: + eryksun, paul.moore, vstinner, tim.golden, ezio.melotti, zach.ware, serhiy.storchaka, steve.dower, vidartf
2016-01-09 09:20:05	eryksun	set	messageid: <1452331205.77.0.181792411864.issue26024@psf.upfronthosting.co.za>
2016-01-09 09:20:05	eryksun	link	issue26024 messages
2016-01-09 09:20:04	eryksun	create