This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author lemburg
Recipients lemburg, vstinner
Date 2021-03-19.09:54:44
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <e66bcdba-3494-ad8c-8170-11144644e34f@egenix.com>
In-reply-to <1616145433.55.0.350148222552.issue43552@roundup.psfhosted.org>
Content
On 19.03.2021 10:17, STINNER Victor wrote:
> 
> New submission from STINNER Victor <vstinner@python.org>:
> 
> I propose to add two new functions:
> 
> * locale.get_locale_encoding(): it's exactly the same than locale.getpreferredencoding(False).
> 
> * locale.get_current_locale_encoding(): always get the current locale encoding. Read the ANSI code page on Windows, or nl_langinfo(CODESET) on other platforms. Ignore the UTF-8 Mode. Don't always return "UTF-8" on macOS, Android, VxWorks.

I'm not sure whether this would improve the situation much.

The problem is that the locale module is meant to expose the lib C
locale settings, but many of the recent additions actually do something
completely different: they look into the process and user environment
and try to determine external settings, which are not reflected in
the lib C locale settings.

I had added locale.getdefaultlocale() to give applications a chance
to determine the locale setting defined by the process environment
*without* calling setlocale(LC_ALL, '') and causing problems
in other threads. I used the X11 database for locale encodings,
which was the closest you could get to in terms of a standard for
encodings at the time (around 2000).

Part of the return value is the encoding, which would be set.

Martin later added locale.getpreferredencoding(), which tries to
determine the encoding in a different way new way, based on
nl_langset(CODEINFO). As you mentioned, this intention was broken
on several platforms by forcing UTF-8 as output. And in many cases,
the API had to call setlocale() as well, causing the thread problems.

However, the problem with nl_langset(CODEINFO) is the same as
with setlocale(): it returns the current state of the lib C
settings, which may well point to the 'C' locale. Not the ones
the user has configured in the OS environment. So while you get
an encoding defined by lib C for the current locale settings
(without guessing it as with locale.getdefaultlocale()), you
still don't get what the user really wants to use.

Unfortunately, lib C does not provide a way to query the locale
database without changing the locale settings at the same time.
This is the main issue we're facing.

Now, the correct way in all this would be to just call
setlocale(LC_ALL, '') at the start of the application and
not try to apply all the magic to get around this. But this
has to be done by the application and not Python (which may
well be embedded into some other application).

I'd suggest to add a single new API:

locale.getencoding()

which interfaces to nl_langinfo(CODESET) or the Windows code
page and does not try to do any magic, ie. does *not* call
setlocale(). It needs to return what the lib C currently
knows and uses as encoding.

locale.getpreferredencoding() should then be deprecated.

It does not make sense to pretend to query information which is
not really directly available from the lib C locale system.

And the documentation should point out that applications should
call setlocale(LC_ALL, '') when they start up, if they want to
get the lib C locale, and thus Python locale module, setup to
work based on what the user really wants -- instead of just
guessing at this.

PS: The locale module normally does not use underscores in
function names, so it's not a good idea to add more.
History
Date User Action Args
2021-03-19 09:54:45lemburgsetrecipients: + lemburg, vstinner
2021-03-19 09:54:45lemburglinkissue43552 messages
2021-03-19 09:54:44lemburgcreate