Message 389195 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	eryksun
Recipients	eryksun, lemburg, methane, vstinner
Date	2021-03-20.23:05:24
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1616281525.24.0.605949642762.issue43552@roundup.psfhosted.org>
In-reply-to

Content
> In my experience, most applications use the ANSI code page because > they use the ANSI flavor of the Windows API. The default encoding at startup and in the "C" locale wouldn't change. It would only differ from the default if setlocale(LC_CTYPE, locale_name) sets it otherwise. The suggestion is to match the behavior of nl_langinfo(CODESET) in Linux and many other POSIX systems. When I say the default encoding won't change, I mean that the Universal C Runtime (ucrt) system component uses the process ANSI code page as the default locale encoding for setlocale(LC_CTYPE, ""). This agrees with what Python has always done, but it disagrees with previous versions of the CRT in Windows. Personally, I think it's a misstep because the user locale isn't necessarily compatible with the process code page, but I'm not looking to change this decision. For example, if the user locale is "el_GR" (Greek, Greece) but the process code page is 1252 (Latin) instead of 1253 (Greek), I get the following result in Python 3.4 (VC++ 10) vs Python 3.5 (ucrt): >py -3.4 -c "from locale import ; print(setlocale(LC_CTYPE, ''))" Greek_Greece.1253 >py -3.5 -c "from locale import ; print(setlocale(LC_CTYPE, ''))" Greek_Greece.1252 The result from VC++ 10 is consistent with the user locale. It's also consistent with multilingual user interface (MUI) text, such as error messages, or at least it should be, because the user locale and user preferred language (i.e. Windows display language) should be consistent. (The control panel dialog to set the user locale in Windows 10 has an option to match the display language, which is the recommended and default setting.) For example, Python uses system error messages that are localized to the user's preferred language: >py -c "import os; os.stat('spam')" Traceback (most recent call last): File "<string>", line 1, in <module> FileNotFoundError: [WinError 2] Δεν είναι δυνατή η εύρεση του καθορισμένου αρχείου από το σύστημα: 'spam' This example is on a system where the process (system) ANSI code page is 1252 (Latin), which cannot encode the user's preferred Greek text. Thankfully Python 3.6+ uses the console's Unicode API, so neither the console session's output code page nor the process code page gets in the way. On the other hand, if this Greek text is written to a file or piped to a child process using subprocess.Popen(), Python's choice of locale encoding based on the process code page (Latin) is incompatible with Greek text, and thus it's incompatible with the current user's preferred locale and language settings. The process ANSI code page from GetACP() has its uses, which are important. It's a system setting that's independent of the current user locale and thus useful when interacting with the legacy system API and as a common encoding for inter-process data exchange when applications do not use Unicode and may be operating in different locales. So if you're writing to a legacy-encoded text file that's shared by multiple users or piping text to an arbitrary program, then using the ANSI code page is probably okay. Though, especially for IPC, there's a good chance that's it's wrong since Windows has never set, let alone enforced, a standard in that case. Using the process ANSI code page in the "C" locale makes sense to me. > What is the use case for using ___lc_codepage()? Is it a different > encoding? I always forget the "_func" suffix in the name; it's ___lc_codepage_func() [1]. The lc_codepage value is the current LC_CTYPE codeset as an integer code page. It's the equivalent of nl_langinfo(CODESET) in POSIX. For UTF-8, the code page is CP_UTF8 (65001), but this get displayed in locale strings as "UTF-8" (or variants such as "utf8"). It could be the LC_CTYPE encoding of just the current thread, but Python does not enable per-thread locales. The CRT has exported ___lc_codepage_func() since VC++ 7.0 (2002), and before that the current lc_codepage value itself was directly exported as __lc_codepage. However, this triple-dundered function is documented as internal and not recommended for use. That's why the code snippet I showed uses _get_current_locale() with locinfo cast to __crt_locale_data_public *. This takes "public" in the struct name at face value. Anything that's declared public should be safe to use, but the locale_t type is frustratingly undocumented even for this public data [2]. If neither approach is supported, locale.get_current_locale_encoding() could instead parse the current locale encoding from setlocale(LC_CTYPE, NULL). The resulting locale string usually includes the codeset (e.g. "Greek_Greece.1253"). The exceptions are the "C" locale and BCP-47 (RFC 5646) locales that do not explicitly use UTF-8 (e.g. "el_GR" or "el" instead of "el_GR.UTF-8"), but these cases can be handled reliably. --- [1] https://docs.microsoft.com/en-us/cpp/c-runtime-library/lc-codepage-func [2] https://docs.microsoft.com/en-us/cpp/c-runtime-library/locale

> In my experience, most applications use the ANSI code page because 
> they use the ANSI flavor of the Windows API.

The default encoding at startup and in the "C" locale wouldn't change. It would only differ from the default if setlocale(LC_CTYPE, locale_name) sets it otherwise. The suggestion is to match the behavior of nl_langinfo(CODESET) in Linux and many other POSIX systems.

When I say the default encoding won't change, I mean that the Universal C Runtime (ucrt) system component uses the process ANSI code page as the default locale encoding for setlocale(LC_CTYPE, ""). This agrees with what Python has always done, but it disagrees with previous versions of the CRT in Windows. Personally, I think it's a misstep because the user locale isn't necessarily compatible with the process code page, but I'm not looking to change this decision. For example, if the user locale is "el_GR" (Greek, Greece) but the process code page is 1252 (Latin) instead of 1253 (Greek), I get the following result in Python 3.4 (VC++ 10) vs Python 3.5 (ucrt):

    >py -3.4 -c "from locale import *; print(setlocale(LC_CTYPE, ''))"
    Greek_Greece.1253

    >py -3.5 -c "from locale import *; print(setlocale(LC_CTYPE, ''))"
    Greek_Greece.1252

The result from VC++ 10 is consistent with the user locale. It's also consistent with multilingual user interface (MUI) text, such as error messages, or at least it should be, because the user locale and user preferred language (i.e. Windows display language) should be consistent. (The control panel dialog to set the user locale in Windows 10 has an option to match the display language, which is the recommended and default setting.)  For example, Python uses system error messages that are localized to the user's preferred language:

    >py -c "import os; os.stat('spam')"
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    FileNotFoundError: [WinError 2] Δεν είναι δυνατή η εύρεση του καθορισμένου αρχείου από το σύστημα: 'spam'

This example is on a system where the process (system) ANSI code page is 1252 (Latin), which cannot encode the user's preferred Greek text. Thankfully Python 3.6+ uses the console's Unicode API, so neither the console session's output code page nor the process code page gets in the way. On the other hand, if this Greek text is written to a file or piped to a child process using subprocess.Popen(), Python's choice of locale encoding based on the process code page (Latin) is incompatible with Greek text, and thus it's incompatible with the current user's preferred locale and language settings.

The process ANSI code page from GetACP() has its uses, which are important. It's a system setting that's independent of the current user locale and thus useful when interacting with the legacy system API and as a common encoding for inter-process data exchange when applications do not use Unicode and may be operating in different locales. So if you're writing to a legacy-encoded text file that's shared by multiple users or piping text to an arbitrary program, then using the ANSI code page is probably okay. Though, especially for IPC, there's a good chance that's it's wrong since Windows has never set, let alone enforced, a standard in that case. 

Using the process ANSI code page in the "C" locale makes sense to me. 

> What is the use case for using ___lc_codepage()? Is it a different 
> encoding?

I always forget the "_func" suffix in the name; it's ___lc_codepage_func() [1]. The lc_codepage value is the current LC_CTYPE codeset as an integer code page. It's the equivalent of nl_langinfo(CODESET) in POSIX. For UTF-8, the code page is CP_UTF8 (65001), but this get displayed in locale strings as "UTF-8" (or variants such as "utf8"). It could be the LC_CTYPE encoding of just the current thread, but Python does not enable per-thread locales.

The CRT has exported ___lc_codepage_func() since VC++ 7.0 (2002), and before that the current lc_codepage value itself was directly exported as __lc_codepage. However, this triple-dundered function is documented as internal and not recommended for use. That's why the code snippet I showed uses _get_current_locale() with locinfo cast to __crt_locale_data_public *. This takes "public" in the struct name at face value. Anything that's declared public should be safe to use, but the locale_t type is frustratingly undocumented even for this public data [2].

If neither approach is supported, locale.get_current_locale_encoding() could instead parse the current locale encoding from setlocale(LC_CTYPE, NULL). The resulting locale string usually includes the codeset (e.g. "Greek_Greece.1253"). The exceptions are the "C" locale and BCP-47 (RFC 5646) locales that do not explicitly use UTF-8 (e.g. "el_GR" or "el" instead of "el_GR.UTF-8"), but these cases can be handled reliably.

---

[1] https://docs.microsoft.com/en-us/cpp/c-runtime-library/lc-codepage-func
[2] https://docs.microsoft.com/en-us/cpp/c-runtime-library/locale

History
Date	User	Action	Args
2021-03-20 23:05:25	eryksun	set	recipients: + eryksun, lemburg, vstinner, methane
2021-03-20 23:05:25	eryksun	set	messageid: <1616281525.24.0.605949642762.issue43552@roundup.psfhosted.org>
2021-03-20 23:05:25	eryksun	link	issue43552 messages
2021-03-20 23:05:24	eryksun	create