New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add locale.get_locale_encoding() and locale.get_current_locale_encoding() #87718
Comments
I propose to add two new functions:
Technically, locale.get_locale_encoding() would simply expose _locale.get_locale_encoding() that I added recently. It calls the new private _Py_GetLocaleEncoding() function (which has no argument). By the way, Python requires nl_langinfo(CODESET) to be built. It's not a new requirement of Python 3.10, but I wanted to note that, I noticed it when I implemented _locale.get_locale_encoding() :-) Python has a bad habit of lying to the user: locale.getpreferredencoding(False) is *NOT* the current locale encoding in multiple cases.
Even if locale.getpreferredencoding(False) already exists, I propose to add locale.get_locale_encoding() because I dislike locale.getpreferredencoding() API. By default, this function sets temporarily LC_CTYPE to the user preferred locale. It can cause mojibake in other threads since setlocale(LC_CTYPE, "") affects all threads :-( Calling locale.getpreferredencoding(), rather than locale.getpreferredencoding(False), is not what most people expect. This API can be misused. On the other side, locale.get_locale_encoding() does exactly what it says: only *get* the encoding, don't *set* temporarily a locale to something else. By the way, the locale.localeconv() function can change temporarily LC_CTYPE locale to the LC_MONETARY locale which can cause other threads to use the wrong LC_CTYPE locale! But this is a different issue. |
On 19.03.2021 10:17, STINNER Victor wrote:
I'm not sure whether this would improve the situation much. The problem is that the locale module is meant to expose the lib C I had added locale.getdefaultlocale() to give applications a chance Part of the return value is the encoding, which would be set. Martin later added locale.getpreferredencoding(), which tries to However, the problem with nl_langset(CODEINFO) is the same as Unfortunately, lib C does not provide a way to query the locale Now, the correct way in all this would be to just call I'd suggest to add a single new API: locale.getencoding() which interfaces to nl_langinfo(CODESET) or the Windows code locale.getpreferredencoding() should then be deprecated. It does not make sense to pretend to query information which is And the documentation should point out that applications should PS: The locale module normally does not use underscores in |
I created this issue while reviewing the implementation of the PEP-597: PR 19481. Copy of my comments on the PR related to this issue. _locale.get_locale_encoding() calls _Py_GetLocaleEncoding() which returns UTF-8 if the Python UTF-8 Mode is enabled. Maybe the function could have a flag: please don't lie to me and return the current locale encoding ;-) Or we could add a function to get the *current* locale encoding: **locale.get_current_locale_encoding()**. This one would ignore the UTF-8 Mode and call nl_langinfo(CODESET). There are APIs to use the *current* locale encoding: PyUnicode_EncodeLocale/PyUnicode_DecodeLocale and _Py_EncodeLocaleEx/_Py_DecodeLocaleEx with current_locale=1. You can see which functions use it:
Currently, open(encoding=None) doesn't work like that. For example, on macOS, Android and VxWorks, it always use UTF-8. And if the UTF-8 Mode is used, UTF-8 is used. In the PEP-597, I read the encoding="locale" is the same than encoding=None but don't emit an EncodingWarning. Where the PEP-597 changes the chosen encoding for encoding=None case? The PEP says "locale encoding" without specifying exactly what it is. In Python, it means different things depending on the context. There is subtle difference the **current** locale encoding and "the locale encoding". I agree that it needs some clarification :-) While we discuss encodings, I never understood why open() gets the current locale encoding from nl_langinfo(CODESET), encoding which can change at runtime while Python is running. For example, if thread A calls open(filename, encoding=None), thread B calls locale.localeconv(), and the LC_MONETARY locale uses a different encoding than the LC_CTYPE locale, thread A can get the LC_MONETARY encoding because of how locale.localeconv() is currently implemented: it changes temporarily LC_CTYPE to LC_MONETARY to decode the monetary fields of localeconv() result. I would prefer that Python uses the same encoding for the whole lifetime of the process, since the beginning until the end. The Python filesystem encoding is a good choice for that. It's the same than locale.getpreferredencoding(False) (currently used by open() and friends), but becomes different if the LC_CTYPE is changed (temporarily or permanently). |
I created PR 24931 to add locale.get_current_locale_encoding(). I tried to clarified the differences between the "current locale encoding" and the "locale encoding". Maybe we should rename the "locale encoding" to the "Python locale encoding", since it's not what most Unix developers would expect. What do you think? While most locale function have no underscore in their name, it seems like the current trend is to allow underscores in names for *new* functions. For example, the sys module has without underscores:
But it got new functions with underscores:
... and there are some old functions with underscores:
In the locale module, there is one existing function with an undercore:
|
Python now does that during its initialization on all platforms. So getpreferredencoding(False) is what its documentation says: the user preferred encoding, the LC_CTYPE locale encoding. On Python 3.7, _Py_SetLocaleFromEnv(LC_CTYPE) was called in _Py_InitializeCore() on Unix, but not on Windows. Since Python 3.8, _PyPreConfig_Write() calls _Py_SetLocaleFromEnv(LC_CTYPE) on all platforms including Windows. See bpo-34485 and my article for more details ("C locale on Windows" section): _Py_SetLocaleFromEnv(LC_CTYPE) calls setlocale(LC_CTYPE, ""), but has more complex code on Android. |
This is locale.get_current_locale_encoding(). I would like to put "current" in the name, because there is a lot of confusion between get_current_locale_encoding() encoding and locale.getpreferredencoding(False) encoding. In locale.getpreferredencoding(False), Python ignores the locale in some cases which is counter intuitive. I propose to add new functions to reduce confusion and better document the subtle differences between the different "locale encodings". That's also why I propose to rename the "locale encoding" to the "Python locale encoding" in the documentation: clarify the Python ignores the locale sometimes. The PEP-538 (coerce the C locale) and PEP-540 (Python UTF-8 Mode) introduced confusion. |
On 19.03.2021 11:36, STINNER Victor wrote:
These attempts have resulted much of the confusion around the locale
locale.getencoding() works in the same way as locale.getlocale(). And, again, locale.getpreferredencoding() should be deprecated. |
Attached encodings.py lists the different "locale encodings" used by Python. Example: $ LANG=fr_FR ./python -X utf8 encodings.py fr_FR@euro
Set LC_CTYPE to 'fr_FR@euro' LC_ALL env var: '' (1) Python FS encoding (2) Python locale encoding (3) Current locale encoding (4) And more encodings for more fun! Python starts with LC_CTYPE locale set to fr_FR (ISO8859-1), then the script sets the LC_CTYPE locale to fr_FR@euro (ISO-8859-15). The Python UTF-8 Mode is enabled explicitly. We get a funny combination of not less than 3 encodings!
Which one is the correct one? Wel... It depends :-) (1) The Python filesystem encoding is used to call almost all operating system functions: encode to the OS and decode from the OS. Filenames, environment variables, command line options, etc. (2) The "Python" locale encoding is used by open() when no encoding is specific. (3) The current locale encoding is used for a limited amount of functions that I listed in msg389063. Most users should not use it. (4) locale.getpreferredencoding(True) is a weird beast. It is Python locale encoding until setlocale(LC_CTYPE, locale) is called for the first time. But it can be same if the Python UTF-8 Mode is enabled. I'm not sure in which category we should put this function :-( (4 bis) locale.getdefaultlocale()[1] is the only function returning the ISO-8859-1 encoding. This encoding is not used by any function. I'm not sure of the purpose of this function. It sounds confusing. I suggest to deprecate locale.getpreferredencoding(True). I'm not sure what to do with locale.getdefaultlocale(). Should we deprecate it? I never used this function. How is it used? For which purpose? I undertand that in 2000, locale.getdefaultlocale() was interesting to avoid calling setlocale(LC_CTYPE, ""). But Python 3 calls setlocale(LC_CTYPE, "") by default at startup since the early versions, and it's now called on all platforms since Python 3.8. Moreover, its internal database seems to be outdated and is painful to maintain (especially if we consider all platforms supported by Python, not only Linux, there are many issues on macOS). |
When I designed and implemented the PEP-540 (Python UTF-8 Mode), I tried to leave getpreferredencoding() unchanged. The problem was that I quickly got mojibake because too many functions call getpreferredencoding(False):
The Python UTF-8 Mode ignores the locale *on purpose*. But I agree that it's surprising and can lead to confusion. That's what I'm trying to fix here :-) |
Recently, I spent some days to document properly encodings used by Python. Python filesystem encoding: Python filesystem errors: stdio encoding and errors: Glossary: "Locale encoding" Glossary: "filesystem encoding and error handler" Python UTF-8 Mode: |
I don't see why the Windows implementation is inconsistent with POSIX here. If it were changed to be consistent, the default encoding at startup would remain the same, since setlocale(LC_CTYPE, "") uses the process code page from GetACP(). In many if not most cases, no one would be the wiser. But it seems to me that if a script calls setlocale(LC_CTYPE, "el_GR"), then it clearly wants to encode Greek text (code page 1253). open() with encoding passed as None or "locale" should respect this. Similarly if it calls setlocale(LC_CTYPE, ".UTF-8"), then it wants the default locale (language/region), but with UTF-8 encoding. The following is a snippet to get the current locale encoding with ucrt in Windows: #include <locale.h>
int cp = 0;
__crt_locale_data_public *locale_data;
_locale_t locale = _get_current_locale();
if (locale) {
locale_data = (__crt_locale_data_public *)locale->locinfo;
cp = locale_data->_locale_lc_codepage;
_free_locale(locale);
}
if (cp == 0) {
/* "C" locale. The CRT in effect uses Latin-1 (cp28591), but
Windows Python prefers the process code page. */
cp = GetACP();
} With ucrt, the C runtime was changed to hide most of the locale definition that was previously public, but it intentionally defines __crt_locale_data_public, so I'm assuming it's there for programs to use. That said, the fact that we have to cast locinfo seems suspect to me. Steve Dower could maybe check with the ucrt devs to ensure that this is supported. There's also ___lc_codepage() to get the same value more simply, and also more efficiently since the current locale data doesn't have to be copied and freed. However, it's documented as internal and could be removed (unlikely as that is). |
On 19.03.2021 12:05, STINNER Victor wrote:
Yes, deprecate it as well. If Python calls setlocale() per default now, The alias database is needed by the normalization engine. We may be |
On 19.03.2021 12:26, STINNER Victor wrote:
Thanks for documenting this. I would prefer to leave the locale module to really just an interface Hopefully, in a few years, we can get rid of all this and standardize |
On 19.03.2021 12:35, Eryk Sun wrote:
I'm not sure I understand what you're saying (but then, I have little My assumption is that nl_langinfo(CODESET) does not work on Windows If it does work, getencoding() could just be a shim over |
Except not for embedding applications if configure_locale [1] isn't set. But in that case determining the default locale isn't Python's problem to solve.
There is no such function for CRT locales. I provided two alternatives that would allow implementing this consistent with POSIX, and thus avoid all of the "except on Windows..." disclaimers that have to explain (apologize) that only the process ANSI code page is used in Windows, and, for no good reason as far as I can tell, the LC_CTYPE locale encoding is completely ignored. --- [1] https://docs.python.org/3/c-api/init_config.html#c.PyPreConfig.configure_locale |
On 19.03.2021 13:25, Eryk Sun wrote:
Sounds good. If we can get consistent behavior on Windows as well, |
I created bpo-43557 "Deprecate getdefaultlocale(), getlocale() and normalize() functions". Let's discuss deprecating getdefaultlocale() there. |
The problem is that there are two different "locale encodings", what I call:
It is unfortunate that the Python UTF-8 Mode which "ignores the locale" changes the behavior of the locale module, of the locale.getpreferredencoding() function. But the ship has sailed. People are used to look into the "locale" module to get the "locale" encoding. So I prefer to put the function to get the "Python locale encoding" in the locale module. I propose to add "current" in the name since this encoding is not the one you are looking for usually. An alternative is to have a single function with an optional parameter. Example:
|
What I want is same to Background: PEP-597 adds new But this is wrong in UTF-8 mode. In UTF-8 mode, it's fine to I don't want to add new meaning here. It should be same to I don't care its name. both of sys.locale_encoding() and locale.get_encoding() are OK. |
Is it about the current implementation of the PEP-597, or are you thinking at the future Python which would use UTF-8 by default? Currently, getpreferredencoding(False) respects the behavior that you described, no? |
I had forgot to consider about UTF-8 mode while finishing PEP-597. If possible, I want to ignore UTF-8 mode when
getpreferredencoding(False) respects UTF-8 mode. That's what PEP-597 said (because the PEP don't define behavior in UTF-8 mode) and #63680 implements. But it is not what I want for now. I want to ignore UTF-8 mode when This is almost "only in Windows" issue, and users can use But |
On 19.03.2021 14:47, STINNER Victor wrote:
The UTF-8 mode is a Python invention. It doesn't have anything to Please don't mix the two. In fact, in order to avoid issues, Python should probably set the locale
-1, both on the names and the idea to again add parameters which change |
On 19.03.2021 14:57, Inada Naoki wrote:
Please address UTF-8 mode explicitly in open() or elsewhere. The locale As mentioned, both should ideally be synchronized, though, so |
I agree with you. APIs in locale module shouldn't aware UTF-8 mode.
There is PEP-538 already :) |
On 19.03.2021 16:15, Inada Naoki wrote:
I already wrote earlier that we should deprecate this API, since the We need to get things separated out clearly again: the locale
Great :-) |
Why is it being specified that the current LC_CTYPE encoding should be ignored in Windows when a "locale" encoding is requested? Cross-platform C code would use mbstowcs() and wcstombs(), with the current LC_CTYPE encoding. That's Latin-1 in the initial "C" locale and defaults to GetACP() if setlocale(LC_CTYPE, "") is called, but otherwise it's whatever locale is requested by the program and supported by the system (all Windows installations support pretty much every locale). |
Because
So It is not an option to assign other encoding. See PEP-597 for detail. I know you are proposing to use CRT locale on Windows. If we change the |
What's returned by locale.get_locale_encoding() and locale.get_current_locale_encoding() is relevant to adding them as new functions and is a chance to implement this correctly in Windows. You're right that what open() does for encoding="locale" is a separate issue, with backwards compatibility problems. I think it was implemented badly and needlessly inconsistent with POSIX. But we may be stuck with the behavior considering scripts are within their rights, per documented behavior, to expect that calling setlocale(LC_CTYPE, locale_name) in Windows has no effect on the result of locale.getpreferredencoding(False), unlike POSIX generally, except for some platforms such as macOS and Android. |
Python uses GetACP(), the ANSI code page of the operating system, for years. What is the advantage of using a different encoding? In my experience, most applications use the ANSI code page because they use the ANSI flavor of the Windows API. What is the use case for using ___lc_codepage()? Is it a different encoding? |
The default encoding at startup and in the "C" locale wouldn't change. It would only differ from the default if setlocale(LC_CTYPE, locale_name) sets it otherwise. The suggestion is to match the behavior of nl_langinfo(CODESET) in Linux and many other POSIX systems. When I say the default encoding won't change, I mean that the Universal C Runtime (ucrt) system component uses the process ANSI code page as the default locale encoding for setlocale(LC_CTYPE, ""). This agrees with what Python has always done, but it disagrees with previous versions of the CRT in Windows. Personally, I think it's a misstep because the user locale isn't necessarily compatible with the process code page, but I'm not looking to change this decision. For example, if the user locale is "el_GR" (Greek, Greece) but the process code page is 1252 (Latin) instead of 1253 (Greek), I get the following result in Python 3.4 (VC++ 10) vs Python 3.5 (ucrt):
The result from VC++ 10 is consistent with the user locale. It's also consistent with multilingual user interface (MUI) text, such as error messages, or at least it should be, because the user locale and user preferred language (i.e. Windows display language) should be consistent. (The control panel dialog to set the user locale in Windows 10 has an option to match the display language, which is the recommended and default setting.) For example, Python uses system error messages that are localized to the user's preferred language: >py -c "import os; os.stat('spam')"
Traceback (most recent call last):
File "<string>", line 1, in <module>
FileNotFoundError: [WinError 2] Δεν είναι δυνατή η εύρεση του καθορισμένου αρχείου από το σύστημα: 'spam' This example is on a system where the process (system) ANSI code page is 1252 (Latin), which cannot encode the user's preferred Greek text. Thankfully Python 3.6+ uses the console's Unicode API, so neither the console session's output code page nor the process code page gets in the way. On the other hand, if this Greek text is written to a file or piped to a child process using subprocess.Popen(), Python's choice of locale encoding based on the process code page (Latin) is incompatible with Greek text, and thus it's incompatible with the current user's preferred locale and language settings. The process ANSI code page from GetACP() has its uses, which are important. It's a system setting that's independent of the current user locale and thus useful when interacting with the legacy system API and as a common encoding for inter-process data exchange when applications do not use Unicode and may be operating in different locales. So if you're writing to a legacy-encoded text file that's shared by multiple users or piping text to an arbitrary program, then using the ANSI code page is probably okay. Though, especially for IPC, there's a good chance that's it's wrong since Windows has never set, let alone enforced, a standard in that case. Using the process ANSI code page in the "C" locale makes sense to me.
I always forget the "_func" suffix in the name; it's ___lc_codepage_func() [1]. The lc_codepage value is the current LC_CTYPE codeset as an integer code page. It's the equivalent of nl_langinfo(CODESET) in POSIX. For UTF-8, the code page is CP_UTF8 (65001), but this get displayed in locale strings as "UTF-8" (or variants such as "utf8"). It could be the LC_CTYPE encoding of just the current thread, but Python does not enable per-thread locales. The CRT has exported ___lc_codepage_func() since VC++ 7.0 (2002), and before that the current lc_codepage value itself was directly exported as __lc_codepage. However, this triple-dundered function is documented as internal and not recommended for use. That's why the code snippet I showed uses _get_current_locale() with locinfo cast to __crt_locale_data_public *. This takes "public" in the struct name at face value. Anything that's declared public should be safe to use, but the locale_t type is frustratingly undocumented even for this public data [2]. If neither approach is supported, locale.get_current_locale_encoding() could instead parse the current locale encoding from setlocale(LC_CTYPE, NULL). The resulting locale string usually includes the codeset (e.g. "Greek_Greece.1253"). The exceptions are the "C" locale and BCP-47 (RFC 5646) locales that do not explicitly use UTF-8 (e.g. "el_GR" or "el" instead of "el_GR.UTF-8"), but these cases can be handled reliably. --- [1] https://docs.microsoft.com/en-us/cpp/c-runtime-library/lc-codepage-func |
PEP-597 was implemented successfully in Python 3.10 with this feature. This is no agreement yet on what is the "current locale encoding". For now, I prefer to close the issue. We can reconsider this feature once there will be more user requests for such function and when there will be an agreement on what is the "current locale encoding". |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: