Message 389062 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	lemburg, vstinner
Date	2021-03-19.09:54:44
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<e66bcdba-3494-ad8c-8170-11144644e34f@egenix.com>
In-reply-to	<1616145433.55.0.350148222552.issue43552@roundup.psfhosted.org>

Content
On 19.03.2021 10:17, STINNER Victor wrote: > > New submission from STINNER Victor <vstinner@python.org>: > > I propose to add two new functions: > > * locale.get_locale_encoding(): it's exactly the same than locale.getpreferredencoding(False). > > * locale.get_current_locale_encoding(): always get the current locale encoding. Read the ANSI code page on Windows, or nl_langinfo(CODESET) on other platforms. Ignore the UTF-8 Mode. Don't always return "UTF-8" on macOS, Android, VxWorks. I'm not sure whether this would improve the situation much. The problem is that the locale module is meant to expose the lib C locale settings, but many of the recent additions actually do something completely different: they look into the process and user environment and try to determine external settings, which are not reflected in the lib C locale settings. I had added locale.getdefaultlocale() to give applications a chance to determine the locale setting defined by the process environment without calling setlocale(LC_ALL, '') and causing problems in other threads. I used the X11 database for locale encodings, which was the closest you could get to in terms of a standard for encodings at the time (around 2000). Part of the return value is the encoding, which would be set. Martin later added locale.getpreferredencoding(), which tries to determine the encoding in a different way new way, based on nl_langset(CODEINFO). As you mentioned, this intention was broken on several platforms by forcing UTF-8 as output. And in many cases, the API had to call setlocale() as well, causing the thread problems. However, the problem with nl_langset(CODEINFO) is the same as with setlocale(): it returns the current state of the lib C settings, which may well point to the 'C' locale. Not the ones the user has configured in the OS environment. So while you get an encoding defined by lib C for the current locale settings (without guessing it as with locale.getdefaultlocale()), you still don't get what the user really wants to use. Unfortunately, lib C does not provide a way to query the locale database without changing the locale settings at the same time. This is the main issue we're facing. Now, the correct way in all this would be to just call setlocale(LC_ALL, '') at the start of the application and not try to apply all the magic to get around this. But this has to be done by the application and not Python (which may well be embedded into some other application). I'd suggest to add a single new API: locale.getencoding() which interfaces to nl_langinfo(CODESET) or the Windows code page and does not try to do any magic, ie. does not call setlocale(). It needs to return what the lib C currently knows and uses as encoding. locale.getpreferredencoding() should then be deprecated. It does not make sense to pretend to query information which is not really directly available from the lib C locale system. And the documentation should point out that applications should call setlocale(LC_ALL, '') when they start up, if they want to get the lib C locale, and thus Python locale module, setup to work based on what the user really wants -- instead of just guessing at this. PS: The locale module normally does not use underscores in function names, so it's not a good idea to add more.

On 19.03.2021 10:17, STINNER Victor wrote:
> 
> New submission from STINNER Victor <vstinner@python.org>:
> 
> I propose to add two new functions:
> 
> * locale.get_locale_encoding(): it's exactly the same than locale.getpreferredencoding(False).
> 
> * locale.get_current_locale_encoding(): always get the current locale encoding. Read the ANSI code page on Windows, or nl_langinfo(CODESET) on other platforms. Ignore the UTF-8 Mode. Don't always return "UTF-8" on macOS, Android, VxWorks.

I'm not sure whether this would improve the situation much.

The problem is that the locale module is meant to expose the lib C
locale settings, but many of the recent additions actually do something
completely different: they look into the process and user environment
and try to determine external settings, which are not reflected in
the lib C locale settings.

I had added locale.getdefaultlocale() to give applications a chance
to determine the locale setting defined by the process environment
*without* calling setlocale(LC_ALL, '') and causing problems
in other threads. I used the X11 database for locale encodings,
which was the closest you could get to in terms of a standard for
encodings at the time (around 2000).

Part of the return value is the encoding, which would be set.

Martin later added locale.getpreferredencoding(), which tries to
determine the encoding in a different way new way, based on
nl_langset(CODEINFO). As you mentioned, this intention was broken
on several platforms by forcing UTF-8 as output. And in many cases,
the API had to call setlocale() as well, causing the thread problems.

However, the problem with nl_langset(CODEINFO) is the same as
with setlocale(): it returns the current state of the lib C
settings, which may well point to the 'C' locale. Not the ones
the user has configured in the OS environment. So while you get
an encoding defined by lib C for the current locale settings
(without guessing it as with locale.getdefaultlocale()), you
still don't get what the user really wants to use.

Unfortunately, lib C does not provide a way to query the locale
database without changing the locale settings at the same time.
This is the main issue we're facing.

Now, the correct way in all this would be to just call
setlocale(LC_ALL, '') at the start of the application and
not try to apply all the magic to get around this. But this
has to be done by the application and not Python (which may
well be embedded into some other application).

I'd suggest to add a single new API:

locale.getencoding()

which interfaces to nl_langinfo(CODESET) or the Windows code
page and does not try to do any magic, ie. does *not* call
setlocale(). It needs to return what the lib C currently
knows and uses as encoding.

locale.getpreferredencoding() should then be deprecated.

It does not make sense to pretend to query information which is
not really directly available from the lib C locale system.

And the documentation should point out that applications should
call setlocale(LC_ALL, '') when they start up, if they want to
get the lib C locale, and thus Python locale module, setup to
work based on what the user really wants -- instead of just
guessing at this.

PS: The locale module normally does not use underscores in
function names, so it's not a good idea to add more.

History
Date	User	Action	Args
2021-03-19 09:54:45	lemburg	set	recipients: + lemburg, vstinner
2021-03-19 09:54:45	lemburg	link	issue43552 messages
2021-03-19 09:54:44	lemburg	create