Message 289377 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	Arfrever, benjamin.peterson, lemburg, loewis, serhiy.storchaka
Date	2017-03-10.15:29:13
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<816bda54-4026-5da6-c4c3-6e4b168b99df@egenix.com>
In-reply-to	<1489131424.74.0.828824726275.issue20087@psf.upfronthosting.co.za>

Content
On 10.03.2017 08:37, Benjamin Peterson wrote: > > Do you believe this program should work? > > import locale, os > for l in open("/usr/share/i18n/SUPPORTED"): > alias, encoding = l.strip().split() > locale.setlocale(locale.LC_ALL, alias) > try: > enc = locale.getlocale()[1] > except ValueError: > continue # not in table > normalized = enc.replace("ISO", "ISO-"). \ > replace("_", "-"). \ > replace("euc", "EUC-"). \ > replace("big5", "big5-").upper() > assert normalized == locale.nl_langinfo(locale.CODESET) > > After my change it does—the encoding returned from getlocale() is the one actually being used by glibc. It fails dramatically on earlier versions of Python (for example on the en_IN example from #29571.) I don't understand why Python needs to editorialize whatever choices libc or the system administrator has made. Your program essentially tests what alias is configured on your particular system. It will fail on older systems (with a different or no version of SUPPORTED), it will fail on systems that do not have all locales installed, it will fail on systems that use the X.org aliases table as basis rather than some list of supported locales of glibc, or custom alias tables. What we want in Python is a consistent mapping of aliases to locales across all (Unix based) Python installations, just like what we have for encoding aliases and those mappings should be taken from a support alias database, not a list of default installations on some glibc version. Also note that a lot of these discussions are really academic, since locales should always be specified with encoding. While Unix gravitates to UTF-8 for all system related things, users still use other encodings a lot for their daily operations, as you can see in the X.org aliases file. This is why defaulting to UTF-8 for locales (as e.g. is done for many locales in the glibc default installs) is not a good idea. Locales affect user work products. What's fine for command line interfacing or piping, is not necessarily for fine for e.g. documents created by users. So to answer your question: No, I don't believe that SUPPORTED has any authority for our purposes and thus don't think that the program can be considered a valid test case. The SUPPORTED file can server as extra resource for fixing bugs in the table, but nothing more. > Is getlocale() expected to return something different from the underlying C locale? getlocale() will return whatever is currently configured via setlocale(). Of course, it can return something different from what some glibc SUPPORTED lists as default installation encoding, if you don't provide the encoding when using setlocale(), but it will always default to the same locale and encoding on all platforms where you run Python. > In fact, why have this table at all instead of using nl_langinfo to return the encoding for the current locale? The table is meant to normalize locale names and enrich them with default encodings from a well known database of such aliases, where necessary. As mentioned above the locale setting should ideally include the encoding as well, so that any such guesses are not necessary. Regarding nl_langinfo(): nl_langinfo() will only work if you have called setlocale() already, since a process always starts up in the C locale without this call. If you don't have a problem with calling setlocale() for testing the default locale settings (e.g. Python is not embedded, you don't have other threads running, no APIs which use locale information called yet, setlocale() was already called to setup the locale, etc.), you can use the approach taken by getpreferredencoding(), which is to temporarily set the locale to the default. Going forward, I think that the following changes make sense: * from ISO8859-1 to ISO8859-15 (the -15 version adds the Euro sign) * casing changes e.g. 'zh_CN.gb2312' to 'zh_CN.GB2312' * fixes which undo removal of modifiers such as 'uz_uz@cyrillic' -> 'uz_UZ.UTF-8' to 'uz_UZ.UTF-8@cyrillic' As for the other changes: please undo them and also revert the unconditional use of glibc mappings overriding the X.org ones, as mentioned earlier in the thread. We can readd some of the modifications later on if there's evidence that they actually do make sense. Thanks, -- Marc-Andre Lemburg eGenix.com

On 10.03.2017 08:37, Benjamin Peterson wrote:
> 
> Do you believe this program should work?
> 
> import locale, os
> for l in open("/usr/share/i18n/SUPPORTED"):
>     alias, encoding = l.strip().split()
>     locale.setlocale(locale.LC_ALL, alias)
>     try:
>         enc = locale.getlocale()[1]
>     except ValueError:
>         continue # not in table
>     normalized = enc.replace("ISO", "ISO-"). \
>                      replace("_", "-"). \
>                      replace("euc", "EUC-"). \
>                      replace("big5", "big5-").upper()
>     assert normalized == locale.nl_langinfo(locale.CODESET)
> 
> After my change it does—the encoding returned from getlocale() is the one actually being used by glibc. It fails dramatically on earlier versions of Python (for example on the en_IN example from #29571.) I don't understand why Python needs to editorialize whatever choices libc or the system administrator has made.

Your program essentially tests what alias is configured
on your particular system. It will fail on older systems
(with a different or no version of SUPPORTED), it will fail on
systems that do not have all locales installed, it will
fail on systems that use the X.org aliases table as basis
rather than some list of supported locales of glibc, or
custom alias tables.

What we want in Python is a consistent mapping of aliases to locales
across all (Unix based) Python installations, just like what we
have for encoding aliases and those mappings should be taken
from a support alias database, not a list of default installations
on some glibc version.

Also note that a lot of these discussions are really academic,
since locales should always be specified with encoding.

While Unix gravitates to UTF-8 for all system related things,
users still use other encodings a lot for their daily operations,
as you can see in the X.org aliases file.

This is why defaulting to UTF-8 for locales (as e.g.
is done for many locales in the glibc default installs) is not
a good idea. Locales affect user work products. What's fine for
command line interfacing or piping, is not necessarily for
fine for e.g. documents created by users.

So to answer your question: No, I don't believe that SUPPORTED
has any authority for our purposes and thus don't think that
the program can be considered a valid test case.

The SUPPORTED file can server as extra resource for fixing bugs
in the table, but nothing more.

> Is getlocale() expected to return something different from the underlying C locale?

getlocale() will return whatever is currently configured via
setlocale().

Of course, it can return something different from what some glibc
SUPPORTED lists as default installation encoding, if you don't provide
the encoding when using setlocale(), but it will always default
to the same locale and encoding on all platforms where you
run Python.

> In fact, why have this table at all instead of using nl_langinfo to return the encoding for the current locale?

The table is meant to normalize locale names and enrich
them with default encodings from a well known database of
such aliases, where necessary. As mentioned above the locale setting
should ideally include the encoding as well, so that any such
guesses are not necessary.

Regarding nl_langinfo():

nl_langinfo() will only work if you have called
setlocale() already, since a process always starts up in
the C locale without this call.

If you don't have a problem with calling setlocale() for
testing the default locale settings (e.g. Python is not
embedded, you don't have other threads running, no
APIs which use locale information called yet, setlocale()
was already called to setup the locale, etc.),
you can use the approach taken by getpreferredencoding(),
which is to temporarily set the locale to the default.

Going forward, I think that the following changes make
sense:

* from ISO8859-1 to ISO8859-15 (the -15 version adds
  the Euro sign)

* casing changes e.g. 'zh_CN.gb2312' to 'zh_CN.GB2312'

* fixes which undo removal of modifiers such as
  'uz_uz@cyrillic' -> 'uz_UZ.UTF-8' to 'uz_UZ.UTF-8@cyrillic'

As for the other changes: please undo them and also
revert the unconditional use of glibc mappings overriding
the X.org ones, as mentioned earlier in the thread.

We can readd some of the modifications later on if there's
evidence that they actually do make sense.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

History
Date	User	Action	Args
2017-03-10 15:29:14	lemburg	set	recipients: + lemburg, loewis, benjamin.peterson, Arfrever, serhiy.storchaka
2017-03-10 15:29:14	lemburg	link	issue20087 messages
2017-03-10 15:29:13	lemburg	create