Author eryksun
Recipients eryksun, paul.moore, steve.dower, terry.reedy, tim.golden, vstinner, zach.ware
Date 2020-02-11.02:10:58
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1581387058.63.0.676943854325.issue38324@roundup.psfhosted.org>
In-reply-to
Content
> On Windows 10 (version 1903), ANSI code page 1252, OEM code page 437, 
> LC_CTYPE locale "French_France.1252"

The CRT default locale (i.e. the empty locale "") uses the user locale, which is the "Format" value on the Region->Formats tab. It does not use the system locale from the Region->Administrative tab. 

The default locale normally uses the user locale's ANSI codepage, as returned by GetLocaleInfoEx(LOCALE_NAME_USER_DEFAULT, LOCALE_IDEFAULTANSICODEPAGE, ...). But if the active codepage of the process is UTF-8, then GetACP(), GetOEMCP(), and setlocale(LC_CTYPE, "") all use UTF-8 (i.e. CP_UTF8, i.e. 65001). The active codepage can be set to UTF-8 either at the system-locale level or in the application-manifest. For example, with the active codepage setting in the manifest:

    C:\>python.utf8.exe -q

    >>> from locale import setlocale, LC_CTYPE
    >>> setlocale(LC_CTYPE, "")
    'English_Canada.utf8'

    >>> kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)
    >>> kernel32.GetACP()
    65001
    >>> kernel32.GetOEMCP()
    65001

A default locale name can also specify the codepage to use. It could be UTF-8, a particular codepage, ".ACP" (ANSI), or ".OCP" (OEM). "ACP" and "OCP" have to be in upper case. For example:

    >>> setlocale(LC_CTYPE, '.utf8')
    'English_Canada.utf8'
    >>> setlocale(LC_CTYPE, '.437')
    'English_Canada.437'

    >>> setlocale(LC_CTYPE, ".ACP")
    'English_Canada.1252'
    >>> setlocale(LC_CTYPE, ".OCP")
    'English_Canada.850'

Otherwise, if you provide a known locale -- using full names, or three-letter abbreviations, or from the small set of locale aliases, then setlocale queries any missing values from the NLS database. 

One snag in the road is the set of Unicode-only locales, such as "Hindi_India". Querying the ANSI and OEM codepages for a Unicode-only locale respectively returns CP_ACP (0) and CP_OEMCP (1). It used to be that the CRT would end up using the system locale for these cases. But recently ucrt has switched to using UTF-8 for these cases. For example:

    >>> setlocale(LC_CTYPE, "Hindi_India")
    'Hindi_India.utf8'

That brings us to the case of modern Windows BCP-47 locale names, which usually lack an implicit encoding. For example:

    >>> setlocale(LC_CTYPE, "hi_IN")
    'hi_IN'

The current CRT codepage can be queried via __lc_codepage_func:

    >>> import ctypes; ucrt = ctypes.CDLL('ucrtbase', use_errno=True)
    >>> ucrt.___lc_codepage_func()
    65001

With the exception of Unicode-only locales, using a modern name without an encoding defaults to the named locale's ANSI codepage. For example:

    >>> setlocale(LC_CTYPE, "en_CA")
    'en_CA'
    >>> ucrt.___lc_codepage_func()
    1252

The only encoding allowed in BCP-47 locale names is ".utf8" or ".utf-8" (case insensitive):

    >>> setlocale(LC_CTYPE, "fr_FR.utf8")
    'fr_FR.utf8'
    >>> setlocale(LC_CTYPE, "fr_FR.UTF-8")
    'fr_FR.UTF-8'

No other encoding is allowed with this form. For example:

    >>> try: setlocale(LC_CTYPE, "fr_FR.ACP")
    ... except Exception as e: print(e)
    ...
    unsupported locale setting
    >>> try: setlocale(LC_CTYPE, "fr_FR.1252")
    ... except Exception as e: print(e)
    ...
    unsupported locale setting

As to the "tr_TR" locale bug, the Windows implementation is broken due to assumptions that POSIX locale names are directly supported. A significant redesign is required to connect the dots.

    >>> from locale import getlocale
    >>> setlocale(LC_CTYPE, 'tr_TR')
    'tr_TR'
    >>> ucrt.___lc_codepage_func()
    1254

    >>> getlocale(LC_CTYPE)
    ('tr_TR', 'ISO8859-9')

Codepage 1254 is similar to ISO8859-9, except, in typical fashion, Microsoft assigned most of the upper control range 0x80-0x9F to an assortment of characters it deemed useful, such as the Euro symbol "€". The exact codepage needs to be queried via __lc_codepage_func() and returned as ('tr_TR', 'cp1254'). 

Conversely, setlocale() needs to know that this BCP-47 name does not support an explicit encoding, unless it's "utf8". If the given codepage, or an associated alias, doesn't match the locale's ANSI codepage, then the locale name has to be expanded to the full name "Turkish_Turkey". The long name allows specifying an arbitrary codepage. 

For example, say we have ('tr_TR', 'ISO8859-7'), i.e. Greek with Turkish locale rules. This transforms to the closest approximation ('tr_TR', '1253'). When setlocale queries the OS, it will find that the ANSI codepage is actually 1254, so it cannot use "tr_TR" or "tr-TR". It needs to expand to the long form:

    >>> setlocale(LC_CTYPE, 'Turkish_Turkey.1253')
    'Turkish_Turkey.1253'
History
Date User Action Args
2020-02-11 02:10:58eryksunsetrecipients: + eryksun, terry.reedy, paul.moore, vstinner, tim.golden, zach.ware, steve.dower
2020-02-11 02:10:58eryksunsetmessageid: <1581387058.63.0.676943854325.issue38324@roundup.psfhosted.org>
2020-02-11 02:10:58eryksunlinkissue38324 messages
2020-02-11 02:10:58eryksuncreate