Author lemburg
Recipients Arfrever, benjamin.peterson, lemburg, loewis, serhiy.storchaka
Date 2017-03-10.15:29:13
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <816bda54-4026-5da6-c4c3-6e4b168b99df@egenix.com>
In-reply-to <1489131424.74.0.828824726275.issue20087@psf.upfronthosting.co.za>
Content
On 10.03.2017 08:37, Benjamin Peterson wrote:
> 
> Do you believe this program should work?
> 
> import locale, os
> for l in open("/usr/share/i18n/SUPPORTED"):
>     alias, encoding = l.strip().split()
>     locale.setlocale(locale.LC_ALL, alias)
>     try:
>         enc = locale.getlocale()[1]
>     except ValueError:
>         continue # not in table
>     normalized = enc.replace("ISO", "ISO-"). \
>                      replace("_", "-"). \
>                      replace("euc", "EUC-"). \
>                      replace("big5", "big5-").upper()
>     assert normalized == locale.nl_langinfo(locale.CODESET)
> 
> After my change it does—the encoding returned from getlocale() is the one actually being used by glibc. It fails dramatically on earlier versions of Python (for example on the en_IN example from #29571.) I don't understand why Python needs to editorialize whatever choices libc or the system administrator has made.

Your program essentially tests what alias is configured
on your particular system. It will fail on older systems
(with a different or no version of SUPPORTED), it will fail on
systems that do not have all locales installed, it will
fail on systems that use the X.org aliases table as basis
rather than some list of supported locales of glibc, or
custom alias tables.

What we want in Python is a consistent mapping of aliases to locales
across all (Unix based) Python installations, just like what we
have for encoding aliases and those mappings should be taken
from a support alias database, not a list of default installations
on some glibc version.

Also note that a lot of these discussions are really academic,
since locales should always be specified with encoding.

While Unix gravitates to UTF-8 for all system related things,
users still use other encodings a lot for their daily operations,
as you can see in the X.org aliases file.

This is why defaulting to UTF-8 for locales (as e.g.
is done for many locales in the glibc default installs) is not
a good idea. Locales affect user work products. What's fine for
command line interfacing or piping, is not necessarily for
fine for e.g. documents created by users.

So to answer your question: No, I don't believe that SUPPORTED
has any authority for our purposes and thus don't think that
the program can be considered a valid test case.

The SUPPORTED file can server as extra resource for fixing bugs
in the table, but nothing more.

> Is getlocale() expected to return something different from the underlying C locale?

getlocale() will return whatever is currently configured via
setlocale().

Of course, it can return something different from what some glibc
SUPPORTED lists as default installation encoding, if you don't provide
the encoding when using setlocale(), but it will always default
to the same locale and encoding on all platforms where you
run Python.

> In fact, why have this table at all instead of using nl_langinfo to return the encoding for the current locale?

The table is meant to normalize locale names and enrich
them with default encodings from a well known database of
such aliases, where necessary. As mentioned above the locale setting
should ideally include the encoding as well, so that any such
guesses are not necessary.

Regarding nl_langinfo():

nl_langinfo() will only work if you have called
setlocale() already, since a process always starts up in
the C locale without this call.

If you don't have a problem with calling setlocale() for
testing the default locale settings (e.g. Python is not
embedded, you don't have other threads running, no
APIs which use locale information called yet, setlocale()
was already called to setup the locale, etc.),
you can use the approach taken by getpreferredencoding(),
which is to temporarily set the locale to the default.

Going forward, I think that the following changes make
sense:

* from ISO8859-1 to ISO8859-15 (the -15 version adds
  the Euro sign)

* casing changes e.g. 'zh_CN.gb2312' to 'zh_CN.GB2312'

* fixes which undo removal of modifiers such as
  'uz_uz@cyrillic' -> 'uz_UZ.UTF-8' to 'uz_UZ.UTF-8@cyrillic'

As for the other changes: please undo them and also
revert the unconditional use of glibc mappings overriding
the X.org ones, as mentioned earlier in the thread.

We can readd some of the modifications later on if there's
evidence that they actually do make sense.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com
History
Date User Action Args
2017-03-10 15:29:14lemburgsetrecipients: + lemburg, loewis, benjamin.peterson, Arfrever, serhiy.storchaka
2017-03-10 15:29:14lemburglinkissue20087 messages
2017-03-10 15:29:13lemburgcreate