Message 389063 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	lemburg, vstinner
Date	2021-03-19.10:17:35
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1616149055.42.0.727843855431.issue43552@roundup.psfhosted.org>
In-reply-to

Content
I created this issue while reviewing the implementation of the PEP 597: PR 19481. Copy of my comments on the PR related to this issue. _locale.get_locale_encoding() calls _Py_GetLocaleEncoding() which returns UTF-8 if the Python UTF-8 Mode is enabled. Maybe the function could have a flag: please don't lie to me and return the current locale encoding ;-) Or we could add a function to get the current locale encoding: locale.get_current_locale_encoding(). This one would ignore the UTF-8 Mode and call nl_langinfo(CODESET). There are APIs to use the current locale encoding: PyUnicode_EncodeLocale/PyUnicode_DecodeLocale and _Py_EncodeLocaleEx/_Py_DecodeLocaleEx with current_locale=1. You can see which functions use it: * decode tm_zone field of localtime_r() and gmtime() * decode tzname[0] and tzname[1] strings * decode setlocale() result * decode some localeconv() fields (this function requires to switch to different locale encoding, it's bad!) * decode nl_langinfo() result * decode gettext(), dgettext(), dcgettext(), textdomain(), bindtextdomain(), bind_textdomain_codeset() result * decode strerror() and dlerror() result * encode/decode in the readline module * encode format string for strftime() in time.strftime() (only used on Windows, Unix provides wcsftime) and then decode strftime() result > encoding="locale" : Uses locale encoding regardless UTF-8 mode. Currently, open(encoding=None) doesn't work like that. For example, on macOS, Android and VxWorks, it always use UTF-8. And if the UTF-8 Mode is used, UTF-8 is used. In the PEP 597, I read the encoding="locale" is the same than encoding=None but don't emit an EncodingWarning. Where the PEP 597 changes the chosen encoding for encoding=None case? The PEP says "locale encoding" without specifying exactly what it is. In Python, it means different things depending on the context. There is subtle difference the current locale encoding and "the locale encoding". I agree that it needs some clarification :-) While we discuss encodings, I never understood why open() gets the current locale encoding from nl_langinfo(CODESET), encoding which can change at runtime while Python is running. For example, if thread A calls open(filename, encoding=None), thread B calls locale.localeconv(), and the LC_MONETARY locale uses a different encoding than the LC_CTYPE locale, thread A can get the LC_MONETARY encoding because of how locale.localeconv() is currently implemented: it changes temporarily LC_CTYPE to LC_MONETARY to decode the monetary fields of localeconv() result. I would prefer that Python uses the same encoding for the whole lifetime of the process, since the beginning until the end. The Python filesystem encoding is a good choice for that. It's the same than locale.getpreferredencoding(False) (currently used by open() and friends), but becomes different if the LC_CTYPE is changed (temporarily or permanently).

I created this issue while reviewing the implementation of the PEP 597: PR 19481.

Copy of my comments on the PR related to this issue.


_locale.get_locale_encoding() calls _Py_GetLocaleEncoding() which returns UTF-8 if the Python UTF-8 Mode is enabled.

Maybe the function could have a flag: please don't lie to me and return the current locale encoding ;-)

Or we could add a function to get the *current* locale encoding: **locale.get_current_locale_encoding()**.

This one would ignore the UTF-8 Mode and call nl_langinfo(CODESET). There are APIs to use the *current* locale encoding: PyUnicode_EncodeLocale/PyUnicode_DecodeLocale and _Py_EncodeLocaleEx/_Py_DecodeLocaleEx with current_locale=1. You can see which functions use it:

* decode tm_zone field of localtime_r() and gmtime()
* decode tzname[0] and tzname[1] strings
* decode setlocale() result
* decode some localeconv() fields (this function requires to switch to different locale encoding, it's bad!)
* decode nl_langinfo() result
* decode gettext(), dgettext(), dcgettext(), textdomain(), bindtextdomain(), bind_textdomain_codeset() result
* decode strerror() and dlerror() result
* encode/decode in the readline module
* encode format string for strftime() in time.strftime() (only used on Windows, Unix provides wcsftime) and then decode strftime() result


> encoding="locale" : Uses locale encoding regardless UTF-8 mode.

Currently, open(encoding=None) doesn't work like that. For example, on macOS, Android and VxWorks, it always use UTF-8. And if the UTF-8 Mode is used, UTF-8 is used.

In the PEP 597, I read the encoding="locale" is the same than encoding=None but don't emit an EncodingWarning. Where the PEP 597 changes the chosen encoding for encoding=None case? The PEP says "locale encoding" without specifying exactly what it is. In Python, it means different things depending on the context. There is subtle difference the **current** locale encoding and "the locale encoding". I agree that it needs some clarification :-)

While we discuss encodings, I never understood why open() gets the current locale encoding from nl_langinfo(CODESET), encoding which can change at runtime while Python is running. For example, if thread A calls open(filename, encoding=None), thread B calls locale.localeconv(), and the LC_MONETARY locale uses a different encoding than the LC_CTYPE locale, thread A can get the LC_MONETARY encoding because of how locale.localeconv() is currently implemented: it changes temporarily LC_CTYPE to LC_MONETARY to decode the monetary fields of localeconv() result.

I would prefer that Python uses the same encoding for the whole lifetime of the process, since the beginning until the end. The Python filesystem encoding is a good choice for that. It's the same than locale.getpreferredencoding(False) (currently used by open() and friends), but becomes different if the LC_CTYPE is changed (temporarily or permanently).

History
Date	User	Action	Args
2021-03-19 10:17:35	vstinner	set	recipients: + vstinner, lemburg
2021-03-19 10:17:35	vstinner	set	messageid: <1616149055.42.0.727843855431.issue43552@roundup.psfhosted.org>
2021-03-19 10:17:35	vstinner	link	issue43552 messages
2021-03-19 10:17:35	vstinner	create