Issue43552
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2021-03-19 09:17 by vstinner, last changed 2022-04-11 14:59 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
encodings.py | vstinner, 2021-03-19 11:05 |
Pull Requests | |||
---|---|---|---|
URL | Status | Linked | Edit |
PR 24931 | closed | vstinner, 2021-03-19 10:15 |
Messages (32) | |||
---|---|---|---|
msg389057 - (view) | Author: STINNER Victor (vstinner) * | Date: 2021-03-19 09:17 | |
I propose to add two new functions: * locale.get_locale_encoding(): it's exactly the same than locale.getpreferredencoding(False). * locale.get_current_locale_encoding(): always get the current locale encoding. Read the ANSI code page on Windows, or nl_langinfo(CODESET) on other platforms. Ignore the UTF-8 Mode. Don't always return "UTF-8" on macOS, Android, VxWorks. Technically, locale.get_locale_encoding() would simply expose _locale.get_locale_encoding() that I added recently. It calls the new private _Py_GetLocaleEncoding() function (which has no argument). By the way, Python requires nl_langinfo(CODESET) to be built. It's not a new requirement of Python 3.10, but I wanted to note that, I noticed it when I implemented _locale.get_locale_encoding() :-) Python has a bad habit of lying to the user: locale.getpreferredencoding(False) is *NOT* the current locale encoding in multiple cases. * locale.getpreferredencoding(False) always return "UTF-8" on macOS, Android and VxWorks * locale.getpreferredencoding(False) always return "UTF-8" if the UTF-8 Mode is enabled * otherwise, it returns the current locale encoding: ANSI code page on Windwos, or nl_langinfo(CODESET) on other platforms Even if locale.getpreferredencoding(False) already exists, I propose to add locale.get_locale_encoding() because I dislike locale.getpreferredencoding() API. By default, this function sets temporarily LC_CTYPE to the user preferred locale. It can cause mojibake in other threads since setlocale(LC_CTYPE, "") affects all threads :-( Calling locale.getpreferredencoding(), rather than locale.getpreferredencoding(False), is not what most people expect. This API can be misused. On the other side, locale.get_locale_encoding() does exactly what it says: only *get* the encoding, don't *set* temporarily a locale to something else. By the way, the locale.localeconv() function can change temporarily LC_CTYPE locale to the LC_MONETARY locale which can cause other threads to use the wrong LC_CTYPE locale! But this is a different issue. |
|||
msg389062 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2021-03-19 09:54 | |
On 19.03.2021 10:17, STINNER Victor wrote: > > New submission from STINNER Victor <vstinner@python.org>: > > I propose to add two new functions: > > * locale.get_locale_encoding(): it's exactly the same than locale.getpreferredencoding(False). > > * locale.get_current_locale_encoding(): always get the current locale encoding. Read the ANSI code page on Windows, or nl_langinfo(CODESET) on other platforms. Ignore the UTF-8 Mode. Don't always return "UTF-8" on macOS, Android, VxWorks. I'm not sure whether this would improve the situation much. The problem is that the locale module is meant to expose the lib C locale settings, but many of the recent additions actually do something completely different: they look into the process and user environment and try to determine external settings, which are not reflected in the lib C locale settings. I had added locale.getdefaultlocale() to give applications a chance to determine the locale setting defined by the process environment *without* calling setlocale(LC_ALL, '') and causing problems in other threads. I used the X11 database for locale encodings, which was the closest you could get to in terms of a standard for encodings at the time (around 2000). Part of the return value is the encoding, which would be set. Martin later added locale.getpreferredencoding(), which tries to determine the encoding in a different way new way, based on nl_langset(CODEINFO). As you mentioned, this intention was broken on several platforms by forcing UTF-8 as output. And in many cases, the API had to call setlocale() as well, causing the thread problems. However, the problem with nl_langset(CODEINFO) is the same as with setlocale(): it returns the current state of the lib C settings, which may well point to the 'C' locale. Not the ones the user has configured in the OS environment. So while you get an encoding defined by lib C for the current locale settings (without guessing it as with locale.getdefaultlocale()), you still don't get what the user really wants to use. Unfortunately, lib C does not provide a way to query the locale database without changing the locale settings at the same time. This is the main issue we're facing. Now, the correct way in all this would be to just call setlocale(LC_ALL, '') at the start of the application and not try to apply all the magic to get around this. But this has to be done by the application and not Python (which may well be embedded into some other application). I'd suggest to add a single new API: locale.getencoding() which interfaces to nl_langinfo(CODESET) or the Windows code page and does not try to do any magic, ie. does *not* call setlocale(). It needs to return what the lib C currently knows and uses as encoding. locale.getpreferredencoding() should then be deprecated. It does not make sense to pretend to query information which is not really directly available from the lib C locale system. And the documentation should point out that applications should call setlocale(LC_ALL, '') when they start up, if they want to get the lib C locale, and thus Python locale module, setup to work based on what the user really wants -- instead of just guessing at this. PS: The locale module normally does not use underscores in function names, so it's not a good idea to add more. |
|||
msg389063 - (view) | Author: STINNER Victor (vstinner) * | Date: 2021-03-19 10:17 | |
I created this issue while reviewing the implementation of the PEP 597: PR 19481. Copy of my comments on the PR related to this issue. _locale.get_locale_encoding() calls _Py_GetLocaleEncoding() which returns UTF-8 if the Python UTF-8 Mode is enabled. Maybe the function could have a flag: please don't lie to me and return the current locale encoding ;-) Or we could add a function to get the *current* locale encoding: **locale.get_current_locale_encoding()**. This one would ignore the UTF-8 Mode and call nl_langinfo(CODESET). There are APIs to use the *current* locale encoding: PyUnicode_EncodeLocale/PyUnicode_DecodeLocale and _Py_EncodeLocaleEx/_Py_DecodeLocaleEx with current_locale=1. You can see which functions use it: * decode tm_zone field of localtime_r() and gmtime() * decode tzname[0] and tzname[1] strings * decode setlocale() result * decode some localeconv() fields (this function requires to switch to different locale encoding, it's bad!) * decode nl_langinfo() result * decode gettext(), dgettext(), dcgettext(), textdomain(), bindtextdomain(), bind_textdomain_codeset() result * decode strerror() and dlerror() result * encode/decode in the readline module * encode format string for strftime() in time.strftime() (only used on Windows, Unix provides wcsftime) and then decode strftime() result > encoding="locale" : Uses locale encoding regardless UTF-8 mode. Currently, open(encoding=None) doesn't work like that. For example, on macOS, Android and VxWorks, it always use UTF-8. And if the UTF-8 Mode is used, UTF-8 is used. In the PEP 597, I read the encoding="locale" is the same than encoding=None but don't emit an EncodingWarning. Where the PEP 597 changes the chosen encoding for encoding=None case? The PEP says "locale encoding" without specifying exactly what it is. In Python, it means different things depending on the context. There is subtle difference the **current** locale encoding and "the locale encoding". I agree that it needs some clarification :-) While we discuss encodings, I never understood why open() gets the current locale encoding from nl_langinfo(CODESET), encoding which can change at runtime while Python is running. For example, if thread A calls open(filename, encoding=None), thread B calls locale.localeconv(), and the LC_MONETARY locale uses a different encoding than the LC_CTYPE locale, thread A can get the LC_MONETARY encoding because of how locale.localeconv() is currently implemented: it changes temporarily LC_CTYPE to LC_MONETARY to decode the monetary fields of localeconv() result. I would prefer that Python uses the same encoding for the whole lifetime of the process, since the beginning until the end. The Python filesystem encoding is a good choice for that. It's the same than locale.getpreferredencoding(False) (currently used by open() and friends), but becomes different if the LC_CTYPE is changed (temporarily or permanently). |
|||
msg389064 - (view) | Author: STINNER Victor (vstinner) * | Date: 2021-03-19 10:22 | |
I created PR 24931 to add locale.get_current_locale_encoding(). I tried to clarified the differences between the "current locale encoding" and the "locale encoding". Maybe we should rename the "locale encoding" to the "Python locale encoding", since it's not what most Unix developers would expect. What do you think? While most locale function have no underscore in their name, it seems like the current trend is to allow underscores in names for *new* functions. For example, the sys module has without underscores: * sys.getallocatedblocks() * sys.getdefaultencoding() * sys.getfilesystemencodeerrors * ... But it got new functions with underscores: * sys.set_asyncgen_hooks() * sys.set_coroutine_origin_tracking_depth() ... and there are some old functions with underscores: * sys.exc_info() * sys.call_tracing() * sys._clear_type_cache() * sys._current_frames() In the locale module, there is one existing function with an undercore: * locale.format_string() |
|||
msg389065 - (view) | Author: STINNER Victor (vstinner) * | Date: 2021-03-19 10:31 | |
> Now, the correct way in all this would be to just call setlocale(LC_ALL, '') at the start of the application Python now does that during its initialization on all platforms. So getpreferredencoding(False) is what its documentation says: the user preferred encoding, the LC_CTYPE locale encoding. On Python 3.7, _Py_SetLocaleFromEnv(LC_CTYPE) was called in _Py_InitializeCore() on Unix, but not on Windows. Since Python 3.8, _PyPreConfig_Write() calls _Py_SetLocaleFromEnv(LC_CTYPE) on all platforms including Windows. See bpo-34485 and my article for more details ("C locale on Windows" section): https://vstinner.github.io/python3-locales-encodings.html _Py_SetLocaleFromEnv(LC_CTYPE) calls setlocale(LC_CTYPE, ""), but has more complex code on Android. |
|||
msg389066 - (view) | Author: STINNER Victor (vstinner) * | Date: 2021-03-19 10:36 | |
> locale.getencoding() > > which interfaces to nl_langinfo(CODESET) or the Windows code > page and does not try to do any magic, ie. does *not* call > setlocale(). It needs to return what the lib C currently > knows and uses as encoding. This is locale.get_current_locale_encoding(). I would like to put "current" in the name, because there is a lot of confusion between get_current_locale_encoding() encoding and locale.getpreferredencoding(False) encoding. In locale.getpreferredencoding(False), Python ignores the locale in some cases which is counter intuitive. I propose to add new functions to reduce confusion and better document the subtle differences between the different "locale encodings". That's also why I propose to rename the "locale encoding" to the "Python locale encoding" in the documentation: clarify the Python ignores the locale sometimes. The PEP 538 (coerce the C locale) and PEP 540 (Python UTF-8 Mode) introduced confusion. |
|||
msg389068 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2021-03-19 10:59 | |
On 19.03.2021 11:36, STINNER Victor wrote: > > STINNER Victor <vstinner@python.org> added the comment: > >> locale.getencoding() >> >> which interfaces to nl_langinfo(CODESET) or the Windows code >> page and does not try to do any magic, ie. does *not* call >> setlocale(). It needs to return what the lib C currently >> knows and uses as encoding. > > This is locale.get_current_locale_encoding(). I would like to put "current" in the name, because there is a lot of confusion between get_current_locale_encoding() encoding and locale.getpreferredencoding(False) encoding. In locale.getpreferredencoding(False), Python ignores the locale in some cases which is counter intuitive. These attempts have resulted much of the confusion around the locale module. It's better not to create more of it. - "locale" in the name is unnecessary, since this is the locale module. - If you add "current", people will rightly ask: then what do all the other APIs in the locale module return ? Of course, they all return the current state of settings :-) So this is unnecessary as well. locale.getencoding() works in the same way as locale.getlocale(). It interfaces to the lib C and returns the current encoding setting as known by the lib C. It's just a more intuitive name than locale.nl_langinfo(CODESET) and works on Windows as well. And, again, locale.getpreferredencoding() should be deprecated. The API has been misused in too many ways and is completely broken by now. It was a good idea at the time, when Martin added it, even though I never liked the name. |
|||
msg389069 - (view) | Author: STINNER Victor (vstinner) * | Date: 2021-03-19 11:05 | |
Attached encodings.py lists the different "locale encodings" used by Python. Example: --- $ LANG=fr_FR ./python -X utf8 encodings.py fr_FR@euro Set LC_CTYPE to 'fr_FR@euro' LC_ALL env var: '' LC_CTYPE env var: '' LANG env var: 'fr_FR' LC_CTYPE locale: 'fr_FR@euro' Coerce C locale: 0 Python UTF-8 Mode: 1 (1) Python FS encoding sys.getfilesystemencoding(): 'utf-8' (2) Python locale encoding _locale._get_locale_encoding(): 'UTF-8' locale.getpreferredencoding(False): 'UTF-8' (3) Current locale encoding locale.get_current_locale_encoding(): 'ISO-8859-15' (4) And more encodings for more fun! locale.getdefaultlocale()[1]: 'ISO8859-1' locale.getpreferredencoding(True): 'UTF-8' --- Python starts with LC_CTYPE locale set to fr_FR (ISO8859-1), then the script sets the LC_CTYPE locale to fr_FR@euro (ISO-8859-15). The Python UTF-8 Mode is enabled explicitly. We get a funny combination of not less than 3 encodings! * UTF-8 * ISO-8859-1 * ISO-8859-15 Which one is the correct one? Wel... It depends :-) (1) The Python filesystem encoding is used to call almost all operating system functions: encode to the OS and decode from the OS. Filenames, environment variables, command line options, etc. (2) The "Python" locale encoding is used by open() when no encoding is specific. (3) The current locale encoding is used for a limited amount of functions that I listed in msg389063. Most users should not use it. (4) locale.getpreferredencoding(True) is a weird beast. It is Python locale encoding until setlocale(LC_CTYPE, locale) is called for the first time. But it can be same if the Python UTF-8 Mode is enabled. I'm not sure in which category we should put this function :-( (4 bis) locale.getdefaultlocale()[1] is the only function returning the ISO-8859-1 encoding. This encoding is not used by any function. I'm not sure of the purpose of this function. It sounds confusing. I suggest to deprecate locale.getpreferredencoding(True). I'm not sure what to do with locale.getdefaultlocale(). Should we deprecate it? I never used this function. How is it used? For which purpose? I undertand that in 2000, locale.getdefaultlocale() was interesting to avoid calling setlocale(LC_CTYPE, ""). But Python 3 calls setlocale(LC_CTYPE, "") by default at startup since the early versions, and it's now called on all platforms since Python 3.8. Moreover, its internal database seems to be outdated and is painful to maintain (especially if we consider all platforms supported by Python, not only Linux, there are many issues on macOS). |
|||
msg389070 - (view) | Author: STINNER Victor (vstinner) * | Date: 2021-03-19 11:11 | |
> Martin later added locale.getpreferredencoding(), which tries to > determine the encoding in a different way new way, based on > nl_langset(CODEINFO). As you mentioned, this intention was broken > on several platforms by forcing UTF-8 as output. When I designed and implemented the PEP 540 (Python UTF-8 Mode), I tried to leave getpreferredencoding() unchanged. The problem was that I quickly got mojibake because too many functions call getpreferredencoding(False): * open() and _pyio.open() -- in Python 3.10, open() now calls the C _Py_GetLocaleEncoding() function to fix issues during Python shutdown, it also avoids issues at startup. * Many gettext functions * cgi to decode the query string from QUERY_STRING env var or sys.argv[1]} * xml.etree.ElementTree.write(encoding="unicode") is some cases The Python UTF-8 Mode ignores the locale *on purpose*. But I agree that it's surprising and can lead to confusion. That's what I'm trying to fix here :-) |
|||
msg389072 - (view) | Author: STINNER Victor (vstinner) * | Date: 2021-03-19 11:26 | |
Recently, I spent some days to document properly encodings used by Python. Python filesystem encoding: https://docs.python.org/dev/c-api/init_config.html#c.PyConfig.filesystem_encoding Python filesystem errors: https://docs.python.org/dev/c-api/init_config.html#c.PyConfig.filesystem_errors stdio encoding and errors: https://docs.python.org/dev/c-api/init_config.html#c.PyConfig.stdio_encoding Glossary: "Locale encoding" https://docs.python.org/dev/glossary.html#term-locale-encoding Glossary: "filesystem encoding and error handler" https://docs.python.org/dev/glossary.html#term-filesystem-encoding-and-error-handler Python UTF-8 Mode: https://docs.python.org/dev/library/os.html#utf8-mode |
|||
msg389074 - (view) | Author: Eryk Sun (eryksun) * | Date: 2021-03-19 11:35 | |
> Read the ANSI code page on Windows, I don't see why the Windows implementation is inconsistent with POSIX here. If it were changed to be consistent, the default encoding at startup would remain the same, since setlocale(LC_CTYPE, "") uses the process code page from GetACP(). In many if not most cases, no one would be the wiser. But it seems to me that if a script calls setlocale(LC_CTYPE, "el_GR"), then it clearly wants to encode Greek text (code page 1253). open() with encoding passed as None or "locale" should respect this. Similarly if it calls setlocale(LC_CTYPE, ".UTF-8"), then it wants the default locale (language/region), but with UTF-8 encoding. The following is a snippet to get the current locale encoding with ucrt in Windows: #include <locale.h> int cp = 0; __crt_locale_data_public *locale_data; _locale_t locale = _get_current_locale(); if (locale) { locale_data = (__crt_locale_data_public *)locale->locinfo; cp = locale_data->_locale_lc_codepage; _free_locale(locale); } if (cp == 0) { /* "C" locale. The CRT in effect uses Latin-1 (cp28591), but Windows Python prefers the process code page. */ cp = GetACP(); } With ucrt, the C runtime was changed to hide most of the locale definition that was previously public, but it intentionally defines __crt_locale_data_public, so I'm assuming it's there for programs to use. That said, the fact that we have to cast locinfo seems suspect to me. Steve Dower could maybe check with the ucrt devs to ensure that this is supported. There's also ___lc_codepage() to get the same value more simply, and also more efficiently since the current locale data doesn't have to be copied and freed. However, it's documented as internal and could be removed (unlikely as that is). |
|||
msg389076 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2021-03-19 12:00 | |
On 19.03.2021 12:05, STINNER Victor wrote: > I'm not sure what to do with locale.getdefaultlocale(). Should we deprecate it? I never used this function. How is it used? For which purpose? > > I undertand that in 2000, locale.getdefaultlocale() was interesting to avoid calling setlocale(LC_CTYPE, ""). But Python 3 calls setlocale(LC_CTYPE, "") by default at startup since the early versions, and it's now called on all platforms since Python 3.8. Moreover, its internal database seems to be outdated and is painful to maintain (especially if we consider all platforms supported by Python, not only Linux, there are many issues on macOS). Yes, deprecate it as well. If Python calls setlocale() per default now, it has served its purpose. The alias database is needed by the normalization engine. We may be able to drop the encoding part, but this would have to be checked. |
|||
msg389079 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2021-03-19 12:11 | |
On 19.03.2021 12:26, STINNER Victor wrote: > > STINNER Victor <vstinner@python.org> added the comment: > > Recently, I spent some days to document properly encodings used by Python. Thanks for documenting this. I would prefer to leave the locale module to really just an interface to the lib C locale logic and not add encoding details which are specific to Python's view on I/O (sys or io) or the file system (os). Hopefully, in a few years, we can get rid of all this and standardize on UTF-8 everywhere. |
|||
msg389080 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2021-03-19 12:11 | |
On 19.03.2021 12:35, Eryk Sun wrote: > > Eryk Sun <eryksun@gmail.com> added the comment: > >> Read the ANSI code page on Windows, > > I don't see why the Windows implementation is inconsistent with POSIX here. If it were changed to be consistent, the default encoding at startup would remain the same, since setlocale(LC_CTYPE, "") uses the process code page from GetACP(). I'm not sure I understand what you're saying (but then, I have little experience with locales on Windows). My assumption is that nl_langinfo(CODESET) does not work on Windows or gives wrong results. Is that incorrect ? If it does work, getencoding() could just be a shim over nl_langinfo(CODESET) on all platforms. |
|||
msg389082 - (view) | Author: Eryk Sun (eryksun) * | Date: 2021-03-19 12:25 | |
> If Python calls setlocale() per default now, it has served its purpose. Except not for embedding applications if configure_locale [1] isn't set. But in that case determining the default locale isn't Python's problem to solve. > My assumption is that nl_langinfo(CODESET) does not work on Windows > or gives wrong results. Is that incorrect ? There is no such function for CRT locales. I provided two alternatives that would allow implementing this consistent with POSIX, and thus avoid all of the "except on Windows..." disclaimers that have to explain (apologize) that only the process ANSI code page is used in Windows, and, for no good reason as far as I can tell, the LC_CTYPE locale encoding is completely ignored. --- [1] https://docs.python.org/3/c-api/init_config.html#c.PyPreConfig.configure_locale |
|||
msg389083 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2021-03-19 12:48 | |
On 19.03.2021 13:25, Eryk Sun wrote: >> My assumption is that nl_langinfo(CODESET) does not work on Windows >> or gives wrong results. Is that incorrect ? > > There is no such function for CRT locales. I provided two alternatives that would allow implementing this consistent with POSIX, and thus avoid all of the "except on Windows..." disclaimers that have to explain (apologize) that only the process ANSI code page is used in Windows, and, for no good reason as far as I can tell, the LC_CTYPE locale encoding is completely ignored. Sounds good. If we can get consistent behavior on Windows as well, all the better :-) |
|||
msg389087 - (view) | Author: STINNER Victor (vstinner) * | Date: 2021-03-19 13:27 | |
I created bpo-43557 "Deprecate getdefaultlocale(), getlocale() and normalize() functions". Let's discuss deprecating getdefaultlocale() there. |
|||
msg389088 - (view) | Author: STINNER Victor (vstinner) * | Date: 2021-03-19 13:47 | |
> - If you add "current", people will rightly ask: then what do all the > other APIs in the locale module return ? Of course, they all return > the current state of settings :-) So this is unnecessary as well. The problem is that there are two different "locale encodings", what I call: * "current locale encoding": nl_langinfo(CODESET) in short * "Python locale encoding": "UTF-8" in some cases, nl_langinfo(CODESET) otherwise It is unfortunate that the Python UTF-8 Mode which "ignores the locale" changes the behavior of the locale module, of the locale.getpreferredencoding() function. But the ship has sailed. People are used to look into the "locale" module to get the "locale" encoding. So I prefer to put the function to get the "Python locale encoding" in the locale module. I propose to add "current" in the name since this encoding is not the one you are looking for usually. An alternative is to have a single function with an optional parameter. Example: * get_locale_encoding() or get_locale_encoding(True) returns the locale encoding * get_locale_encoding(False) returns the current locale encoding |
|||
msg389089 - (view) | Author: Inada Naoki (methane) * | Date: 2021-03-19 13:57 | |
> I created this issue while reviewing the implementation of the PEP 597: PR 19481. What I want is same to `locale.getpreferredencoding(False)` but ignores UTF-8 mode. Background: PEP 597 adds new `encoding="locale"`option to open() and TextIOWrapper(). It is same to `encoding=None` for now, but it means using "locale encoding" explicitly. But this is wrong in UTF-8 mode. In UTF-8 mode, it's fine to `open(filename)` uses UTF-8. But I want to use "locale encoding" for `open(filename, encoding="locale")` because "locale" encoding is specified. I don't want to add new meaning here. It should be same to `locale.getpreferredencoding(False)` without UTF-8 mode. So I need "cp%d" % GetACP() on Windows, not CRT locale encoding. I don't care its name. both of sys.locale_encoding() and locale.get_encoding() are OK. |
|||
msg389090 - (view) | Author: STINNER Victor (vstinner) * | Date: 2021-03-19 14:01 | |
> In UTF-8 mode, it's fine to `open(filename)` uses UTF-8. But I want to use "locale encoding" for `open(filename, encoding="locale")` because "locale" encoding is specified. Is it about the current implementation of the PEP 597, or are you thinking at the future Python which would use UTF-8 by default? Currently, getpreferredencoding(False) respects the behavior that you described, no? |
|||
msg389091 - (view) | Author: Inada Naoki (methane) * | Date: 2021-03-19 14:12 | |
> Is it about the current implementation of the PEP 597, or are you thinking at the future Python which would use UTF-8 by default? I had forgot to consider about UTF-8 mode while finishing PEP 597. If possible, I want to ignore UTF-8 mode when `encoding="locale"` is specified from Python 3.10. Otherwise, behavior will be changed between Python 3.10 and 3.11. > Currently, getpreferredencoding(False) respects the behavior that you described, no? getpreferredencoding(False) respects UTF-8 mode. That's what PEP 597 said (because the PEP don't define behavior in UTF-8 mode) and GH-19481 implements. But it is not what I want for now. I want to ignore UTF-8 mode when `encoding="locale"` is specified. This is almost "only in Windows" issue, and users can use `encoding="mbcs"` in Windows-only script. But `encoding="locale"` is new and recommended way to specify using "locale" encoding explicitly. When user specify "locale" encoding explicitly, I think we should respect it regardless UTF-8 mode. |
|||
msg389093 - (view) | Author: STINNER Victor (vstinner) * | Date: 2021-03-19 14:17 | |
Hum, latest messages are specific to the PEP 597 (implementation). > I had forgot to consider about UTF-8 mode while finishing PEP 597. I propose to continue the discussion about the PEP 597 in bpo-43510. I replied there. I prefer to keep this issue to discuss the locale module. |
|||
msg389098 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2021-03-19 14:56 | |
On 19.03.2021 14:47, STINNER Victor wrote: > > STINNER Victor <vstinner@python.org> added the comment: > >> - If you add "current", people will rightly ask: then what do all the >> other APIs in the locale module return ? Of course, they all return >> the current state of settings :-) So this is unnecessary as well. > > The problem is that there are two different "locale encodings", what I call: > > * "current locale encoding": nl_langinfo(CODESET) in short > * "Python locale encoding": "UTF-8" in some cases, nl_langinfo(CODESET) otherwise The UTF-8 mode is a Python invention. It doesn't have anything to do with the lib C locale functions, which this module addresses and interfaces to. Please don't mix the two. In fact, in order to avoid issues, Python should probably set the locale encoding to UTF-8 as well, when run in UTF-8 mode. It's dangerous to have Python and the lib C use different assumptions about the encoding, esp. in embedded applications. > It is unfortunate that the Python UTF-8 Mode which "ignores the locale" changes the behavior of the locale module, of the locale.getpreferredencoding() function. But the ship has sailed. > > People are used to look into the "locale" module to get the "locale" encoding. So I prefer to put the function to get the "Python locale encoding" in the locale module. > > I propose to add "current" in the name since this encoding is not the one you are looking for usually. > > An alternative is to have a single function with an optional parameter. Example: > > * get_locale_encoding() or get_locale_encoding(True) returns the locale encoding > * get_locale_encoding(False) returns the current locale encoding -1, both on the names and the idea to again add parameters which change their meaning. We should have one function per meaning and really only need the interface getencoding(), since the UTF-8 mode doesn't fit into the locale module scope. |
|||
msg389100 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2021-03-19 15:02 | |
On 19.03.2021 14:57, Inada Naoki wrote: > > Background: PEP 597 adds new `encoding="locale"`option to open() and TextIOWrapper(). It is same to `encoding=None` for now, but it means using "locale encoding" explicitly. > > But this is wrong in UTF-8 mode. Please address UTF-8 mode explicitly in open() or elsewhere. The locale module is about the state of the lib C, not what Python enforces via options in its own I/O layers. As mentioned, both should ideally be synchronized, though, so UTF-8 mode in Python should trigger setting a UTF-8 encoding via setlocale(). |
|||
msg389101 - (view) | Author: Inada Naoki (methane) * | Date: 2021-03-19 15:15 | |
> Please address UTF-8 mode explicitly in open() or elsewhere. The locale > module is about the state of the lib C, not what Python enforces via > options in its own I/O layers. I agree with you. APIs in locale module shouldn't aware UTF-8 mode. `locale.getpreferredencoding()` is special, because it "Return the encoding used for text data, according to user preferences. User preferences are expressed differently on different systems, and might not be available programmatically on some systems, so this function only returns a guess." > As mentioned, both should ideally be synchronized, though, so > UTF-8 mode in Python should trigger setting a UTF-8 encoding > via setlocale(). There is PEP 538 already :) |
|||
msg389102 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2021-03-19 15:22 | |
On 19.03.2021 16:15, Inada Naoki wrote: > > `locale.getpreferredencoding()` is special, because it "Return the encoding used for text data, according to user preferences. User preferences are expressed differently on different systems, and might not be available programmatically on some systems, so this function only returns a guess." I already wrote earlier that we should deprecate this API, since the overloading with different meanings in the past has turned it into an unreliable source of information. At this point, it returns "some encoding, which may or may not be what you want" :-) We need to get things separated out clearly again: the locale module is for the lib C locale state. What Python does in the I/O layers has to be defined and queries at the appropriate places elsewhere (e.g. os, sys or io modules). >> As mentioned, both should ideally be synchronized, though, so >> UTF-8 mode in Python should trigger setting a UTF-8 encoding >> via setlocale(). > > There is PEP 538 already :) Great :-) |
|||
msg389118 - (view) | Author: Eryk Sun (eryksun) * | Date: 2021-03-19 22:13 | |
> But it is not what I want for now. I want to ignore UTF-8 mode > when `encoding="locale"` is specified. > This is almost "only in Windows" issue, and users can use > `encoding="mbcs"` in Windows-only script. Why is it being specified that the current LC_CTYPE encoding should be ignored in Windows when a "locale" encoding is requested? Cross-platform C code would use mbstowcs() and wcstombs(), with the current LC_CTYPE encoding. That's Latin-1 in the initial "C" locale and defaults to GetACP() if setlocale(LC_CTYPE, "") is called, but otherwise it's whatever locale is requested by the program and supported by the system (all Windows installations support pretty much every locale). |
|||
msg389131 - (view) | Author: Inada Naoki (methane) * | Date: 2021-03-20 00:36 | |
> Why is it being specified that the current LC_CTYPE encoding should be ignored in Windows when a "locale" encoding is requested? Because `encoding="locale"` must be replacement of the current `encoding=None` (i.e. locale.getpreferredencoding(False). `encoding=None` behavior will be changed if we change the default encoding or enable UTF-8 mode by default. So we are adding an explicit name to current behavior. So It is not an option to assign other encoding. See PEP 597 for detail. I know you are proposing to use CRT locale on Windows. If we change the `locale.getpreferredencoding(False)` to use CRT locale, `encoding="locale"` follow it. But please discuss it in another issue. |
|||
msg389136 - (view) | Author: Eryk Sun (eryksun) * | Date: 2021-03-20 01:37 | |
> But please discuss it in another issue. What's returned by locale.get_locale_encoding() and locale.get_current_locale_encoding() is relevant to adding them as new functions and is a chance to implement this correctly in Windows. You're right that what open() does for encoding="locale" is a separate issue, with backwards compatibility problems. I think it was implemented badly and needlessly inconsistent with POSIX. But we may be stuck with the behavior considering scripts are within their rights, per documented behavior, to expect that calling setlocale(LC_CTYPE, locale_name) in Windows has no effect on the result of locale.getpreferredencoding(False), unlike POSIX generally, except for some platforms such as macOS and Android. |
|||
msg389159 - (view) | Author: STINNER Victor (vstinner) * | Date: 2021-03-20 14:45 | |
Python uses GetACP(), the ANSI code page of the operating system, for years. What is the advantage of using a different encoding? In my experience, most applications use the ANSI code page because they use the ANSI flavor of the Windows API. What is the use case for using ___lc_codepage()? Is it a different encoding? |
|||
msg389195 - (view) | Author: Eryk Sun (eryksun) * | Date: 2021-03-20 23:05 | |
> In my experience, most applications use the ANSI code page because > they use the ANSI flavor of the Windows API. The default encoding at startup and in the "C" locale wouldn't change. It would only differ from the default if setlocale(LC_CTYPE, locale_name) sets it otherwise. The suggestion is to match the behavior of nl_langinfo(CODESET) in Linux and many other POSIX systems. When I say the default encoding won't change, I mean that the Universal C Runtime (ucrt) system component uses the process ANSI code page as the default locale encoding for setlocale(LC_CTYPE, ""). This agrees with what Python has always done, but it disagrees with previous versions of the CRT in Windows. Personally, I think it's a misstep because the user locale isn't necessarily compatible with the process code page, but I'm not looking to change this decision. For example, if the user locale is "el_GR" (Greek, Greece) but the process code page is 1252 (Latin) instead of 1253 (Greek), I get the following result in Python 3.4 (VC++ 10) vs Python 3.5 (ucrt): >py -3.4 -c "from locale import *; print(setlocale(LC_CTYPE, ''))" Greek_Greece.1253 >py -3.5 -c "from locale import *; print(setlocale(LC_CTYPE, ''))" Greek_Greece.1252 The result from VC++ 10 is consistent with the user locale. It's also consistent with multilingual user interface (MUI) text, such as error messages, or at least it should be, because the user locale and user preferred language (i.e. Windows display language) should be consistent. (The control panel dialog to set the user locale in Windows 10 has an option to match the display language, which is the recommended and default setting.) For example, Python uses system error messages that are localized to the user's preferred language: >py -c "import os; os.stat('spam')" Traceback (most recent call last): File "<string>", line 1, in <module> FileNotFoundError: [WinError 2] Δεν είναι δυνατή η εύρεση του καθορισμένου αρχείου από το σύστημα: 'spam' This example is on a system where the process (system) ANSI code page is 1252 (Latin), which cannot encode the user's preferred Greek text. Thankfully Python 3.6+ uses the console's Unicode API, so neither the console session's output code page nor the process code page gets in the way. On the other hand, if this Greek text is written to a file or piped to a child process using subprocess.Popen(), Python's choice of locale encoding based on the process code page (Latin) is incompatible with Greek text, and thus it's incompatible with the current user's preferred locale and language settings. The process ANSI code page from GetACP() has its uses, which are important. It's a system setting that's independent of the current user locale and thus useful when interacting with the legacy system API and as a common encoding for inter-process data exchange when applications do not use Unicode and may be operating in different locales. So if you're writing to a legacy-encoded text file that's shared by multiple users or piping text to an arbitrary program, then using the ANSI code page is probably okay. Though, especially for IPC, there's a good chance that's it's wrong since Windows has never set, let alone enforced, a standard in that case. Using the process ANSI code page in the "C" locale makes sense to me. > What is the use case for using ___lc_codepage()? Is it a different > encoding? I always forget the "_func" suffix in the name; it's ___lc_codepage_func() [1]. The lc_codepage value is the current LC_CTYPE codeset as an integer code page. It's the equivalent of nl_langinfo(CODESET) in POSIX. For UTF-8, the code page is CP_UTF8 (65001), but this get displayed in locale strings as "UTF-8" (or variants such as "utf8"). It could be the LC_CTYPE encoding of just the current thread, but Python does not enable per-thread locales. The CRT has exported ___lc_codepage_func() since VC++ 7.0 (2002), and before that the current lc_codepage value itself was directly exported as __lc_codepage. However, this triple-dundered function is documented as internal and not recommended for use. That's why the code snippet I showed uses _get_current_locale() with locinfo cast to __crt_locale_data_public *. This takes "public" in the struct name at face value. Anything that's declared public should be safe to use, but the locale_t type is frustratingly undocumented even for this public data [2]. If neither approach is supported, locale.get_current_locale_encoding() could instead parse the current locale encoding from setlocale(LC_CTYPE, NULL). The resulting locale string usually includes the codeset (e.g. "Greek_Greece.1253"). The exceptions are the "C" locale and BCP-47 (RFC 5646) locales that do not explicitly use UTF-8 (e.g. "el_GR" or "el" instead of "el_GR.UTF-8"), but these cases can be handled reliably. --- [1] https://docs.microsoft.com/en-us/cpp/c-runtime-library/lc-codepage-func [2] https://docs.microsoft.com/en-us/cpp/c-runtime-library/locale |
|||
msg396686 - (view) | Author: STINNER Victor (vstinner) * | Date: 2021-06-29 00:19 | |
PEP 597 was implemented successfully in Python 3.10 with this feature. This is no agreement yet on what is the "current locale encoding". For now, I prefer to close the issue. We can reconsider this feature once there will be more user requests for such function and when there will be an agreement on what is the "current locale encoding". |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:59:43 | admin | set | github: 87718 |
2021-06-29 00:19:02 | vstinner | set | status: open -> closed resolution: rejected messages: + msg396686 stage: patch review -> resolved |
2021-03-20 23:05:25 | eryksun | set | messages: + msg389195 |
2021-03-20 14:45:16 | vstinner | set | messages: + msg389159 |
2021-03-20 01:37:15 | eryksun | set | messages: + msg389136 |
2021-03-20 00:36:21 | methane | set | messages: + msg389131 |
2021-03-19 22:13:13 | eryksun | set | messages: + msg389118 |
2021-03-19 15:22:29 | lemburg | set | messages: + msg389102 |
2021-03-19 15:15:12 | methane | set | messages: + msg389101 |
2021-03-19 15:02:41 | lemburg | set | messages: + msg389100 |
2021-03-19 14:56:53 | lemburg | set | messages: + msg389098 |
2021-03-19 14:17:18 | vstinner | set | messages: + msg389093 |
2021-03-19 14:12:25 | methane | set | messages: + msg389091 |
2021-03-19 14:01:13 | vstinner | set | messages: + msg389090 |
2021-03-19 13:57:29 | methane | set | messages: + msg389089 |
2021-03-19 13:47:36 | vstinner | set | messages: + msg389088 |
2021-03-19 13:27:51 | vstinner | set | messages: + msg389087 |
2021-03-19 12:48:02 | lemburg | set | messages: + msg389083 |
2021-03-19 12:25:03 | eryksun | set | messages: + msg389082 |
2021-03-19 12:11:37 | lemburg | set | messages: + msg389080 |
2021-03-19 12:11:27 | lemburg | set | messages: + msg389079 |
2021-03-19 12:00:06 | lemburg | set | messages: + msg389076 |
2021-03-19 11:35:25 | eryksun | set | nosy:
+ eryksun messages: + msg389074 |
2021-03-19 11:34:54 | vstinner | set | nosy:
+ methane |
2021-03-19 11:26:39 | vstinner | set | messages: + msg389072 |
2021-03-19 11:11:34 | vstinner | set | messages: + msg389070 |
2021-03-19 11:05:29 | vstinner | set | files:
+ encodings.py messages: + msg389069 |
2021-03-19 10:59:43 | lemburg | set | messages: + msg389068 |
2021-03-19 10:36:11 | vstinner | set | messages: + msg389066 |
2021-03-19 10:31:12 | vstinner | set | messages: + msg389065 |
2021-03-19 10:22:54 | vstinner | set | messages: + msg389064 |
2021-03-19 10:17:35 | vstinner | set | messages: + msg389063 |
2021-03-19 10:15:58 | vstinner | set | keywords:
+ patch stage: patch review pull_requests: + pull_request23693 |
2021-03-19 09:54:45 | lemburg | set | nosy:
+ lemburg messages: + msg389062 title: Add locale.get_locale_encoding() and locale.get_current_locale_encoding() -> Add locale.get_locale_encoding() and locale.get_current_locale_encoding() |
2021-03-19 09:17:13 | vstinner | create |