Message 350820 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	eryksun
Recipients	eryksun, paul.moore, steve.dower, tim.golden, xtreak, zach.ware
Date	2019-08-29.19:47:43
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1567108064.16.0.803534269254.issue37945@roundup.psfhosted.org>
In-reply-to

Content
Here's some additional background information for work on this issue. A Unix locale identifier has the following form: "language[_territory][.codeset][@modifier]" \| "POSIX" \| "C" \| "" \| NULL (X/Open Portability Guide, Issue 4, 1992 -- aka XPG4) Some systems also implement "C.UTF-8". The language and territory should use ISO 639 and ISO 3166 alpha-2 codes. The "@" modifier may indicate an alternate script such as "sr_RS@latin" or an alternate currency such as "de_DE@euro". For the optional codeset, IANA publishes the following table of character sets: http://www.iana.org/assignments/character-sets/character-sets.xhtml In Debian Linux, the available encodings are defined by mapping files in "/usr/share/i18n/charmaps". But encodings can't be arbitrarily used in locales at run time. A locale has to be generated (see "/etc/locale.gen") before it's available. A Windows (not ucrt) locale name has the following form: "ISO639Language[-ISO15924Script][-ISO3166Region][SubTag][_SortOrder]" \| "" \| LOCALE_NAME_INVARIANT \| "!x-sys-default-locale" \| LOCALE_NAME_SYSTEM_DEFAULT \| NULL \| LOCALE_NAME_USER_DEFAULT The invariant locale provides stable data. The system and user default locales vary according to the Control Panel "Region" settings. A locale name is based on BCP 47 language tags, with the form "<language>-<script>-<region>"(e.g. "en-Latn-GB"), for which the script and region codes are optional. The language is an ISO 639 alpha-2 or alpha-3 code, with alpha-2 preferred. The script is an initial-uppercase ISO 15924 code. The region is an ISO 3166-1 alpha-2 or numeric-3 code, with alpha-2 preferred. As specified, the sort-order code should be delimited by an underscore, but Windows 10 (maybe older versions also?) accepts a hyphen instead. Here's a list of the sort-order codes that I've seen: * mathan - Math Alphanumerics ( x-IV_mathan) * phoneb - Phone Book (de-DE_phoneb) * modern - Modern (ka-GE_modern) * tradnl - Traditional (es-ES_tradnl) * technl - Technical (hu-HU_technl) * radstr - Radical/Stroke (ja-JP_radstr) * stroke - Stroke Count (zh-CN_stroke) * pronun - Pronunciation (Bopomofo) (zh-TW_pronun) One final note of interest about Windows locales is that the user-interface language has been functionally isolated from the locale. The display language is handled by the Multilinugual User Interface (MUI) API, which depends on .mui files in locale-named subdirectories of a binary, such as "kernel32.dll" -> "en-US\kernel32.dll.mui". Windows 10 has an option to configure the user locale to match the preferred display language. This helps to keep the two in sync, but they're still functionally independent. The Universal CRT (ucrt) in Windows supports the following syntax for a locale identifier: "ISO639Language[-ISO15924Script][-ISO3166Region][.utf8\|.utf-8]" \| "ISO639Language[-ISO15924Script][-ISO3166Region][SubTag][_SortOrder]" \| "language[_region][.codepage\|.utf8\|.utf-8]" \| ".codepage" \| ".utf8" \| ".utf-8" \| "C" \| "" \| NULL NULL is used with setlocale to query the current value of a category. The empty string "" is the current-user locale. "C" is a minimal locale. For LC_CTYPE, "C" uses Latin-1, but for LC_TIME it uses the system ANSI codepage (possibly multi-byte), which can lead to mojibake. The "POSIX" locale is not supported, nor is "C.UTF-8". Note that UTF-8 support is relatively new, as is the ability to set the encoding without also specifying a region (e.g. "english.utf8"). Recent versions of ucrt extend BCP-47 support in a couple of ways. Underscore is allowed in addition to hyphen as the tag delimiter (e.g "en_GB" instead of "en-GB"), and specifying UTF-8 as the encoding (and only UTF-8) is supported. If UTF-8 isn't specified, internally the locale defaults to the language's ANSI codepage. ucrt has to parse BCP 47 locales manually if they include an encoding, and also in some cases when underscore is used. Currently this fails to handle a sort-order tag, so we can't use, for example, "de_DE_phoneb.utf8".

Here's some additional background information for work on this issue.

A Unix locale identifier has the following form:

    "language[_territory][.codeset][@modifier]"
        | "POSIX"
        | "C"
        | ""
        | NULL

(X/Open Portability Guide, Issue 4, 1992 -- aka XPG4)

Some systems also implement "C.UTF-8". 

The language and territory should use ISO 639 and ISO 3166 alpha-2 codes. The "@" modifier may indicate an alternate script such as "sr_RS@latin" or an alternate currency such as "de_DE@euro". For the optional codeset, IANA publishes the following table of character sets:

http://www.iana.org/assignments/character-sets/character-sets.xhtml

In Debian Linux, the available encodings are defined by mapping files in "/usr/share/i18n/charmaps". But encodings can't be arbitrarily used in locales at run time. A locale has to be generated (see "/etc/locale.gen") before it's available. 

A Windows (not ucrt) locale name has the following form:

    "ISO639Language[-ISO15924Script][-ISO3166Region][SubTag][_SortOrder]"
        | ""                      | LOCALE_NAME_INVARIANT
        | "!x-sys-default-locale" | LOCALE_NAME_SYSTEM_DEFAULT
        | NULL                    | LOCALE_NAME_USER_DEFAULT

The invariant locale provides stable data. The system and user default locales vary according to the Control Panel "Region" settings.

A locale name is based on BCP 47 language tags, with the form "<language>-<script>-<region>"(e.g. "en-Latn-GB"), for which the script and region codes are optional. The language is an ISO 639 alpha-2 or alpha-3 code, with alpha-2 preferred. The script is an initial-uppercase ISO 15924 code. The region is an ISO 3166-1 alpha-2 or numeric-3 code, with alpha-2 preferred. 

As specified, the sort-order code should be delimited by an underscore, but Windows 10 (maybe older versions also?) accepts a hyphen instead. Here's a list of the sort-order codes that I've seen:

    * mathan - Math Alphanumerics       ( x-IV_mathan)
    * phoneb - Phone Book               (de-DE_phoneb)
    * modern - Modern                   (ka-GE_modern)
    * tradnl - Traditional              (es-ES_tradnl)
    * technl - Technical                (hu-HU_technl)
    * radstr - Radical/Stroke           (ja-JP_radstr)
    * stroke - Stroke Count             (zh-CN_stroke)
    * pronun - Pronunciation (Bopomofo) (zh-TW_pronun)

One final note of interest about Windows locales is that the user-interface language has been functionally isolated from the locale. The display language is handled by the Multilinugual User Interface (MUI) API, which depends on .mui files in locale-named subdirectories of a binary, such as "kernel32.dll" -> "en-US\kernel32.dll.mui". Windows 10 has an option to configure the user locale to match the preferred display language. This helps to keep the two in sync, but they're still functionally independent.

The Universal CRT (ucrt) in Windows supports the following syntax for a locale identifier:

    "ISO639Language[-ISO15924Script][-ISO3166Region][.utf8|.utf-8]"
        | "ISO639Language[-ISO15924Script][-ISO3166Region][SubTag][_SortOrder]"
        | "language[_region][.codepage|.utf8|.utf-8]"
        | ".codepage" | ".utf8" | ".utf-8"
        | "C"
        | ""
        | NULL

NULL is used with setlocale to query the current value of a category. The empty string "" is the current-user locale. "C" is a minimal locale. For LC_CTYPE, "C" uses Latin-1, but for LC_TIME it uses the system ANSI codepage (possibly multi-byte), which can lead to mojibake. The "POSIX" locale is not supported, nor is "C.UTF-8". 

Note that UTF-8 support is relatively new, as is the ability to set the encoding without also specifying a region (e.g. "english.utf8").

Recent versions of ucrt extend BCP-47 support in a couple of ways. Underscore is allowed in addition to hyphen as the tag delimiter (e.g "en_GB" instead of "en-GB"), and specifying UTF-8 as the encoding (and only UTF-8) is supported. If UTF-8 isn't specified, internally the locale defaults to the language's ANSI codepage. ucrt has to parse BCP 47 locales manually if they include an encoding, and also in some cases when underscore is used. Currently this fails to handle a sort-order tag, so we can't use, for example, "de_DE_phoneb.utf8".

History
Date	User	Action	Args
2019-08-29 19:47:44	eryksun	set	recipients: + eryksun, paul.moore, tim.golden, zach.ware, steve.dower, xtreak
2019-08-29 19:47:44	eryksun	set	messageid: <1567108064.16.0.803534269254.issue37945@roundup.psfhosted.org>
2019-08-29 19:47:44	eryksun	link	issue37945 messages
2019-08-29 19:47:43	eryksun	create