Title: locale.normalize() and getdefaultlocale() convert C.UTF-8 to en_US.UTF-8
Type: Stage:
Components: Versions: Python 3.8, Python 3.7, Python 3.6, Python 3.4, Python 3.5
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Jeffrey.Kintscher, benjamin.peterson, gordonmessmer, hroncok, lemburg, mattheww, ncoghlan, serhiy.storchaka, vstinner
Priority: normal Keywords:

Created on 2017-06-25 16:58 by mattheww, last changed 2019-07-11 20:17 by Jeffrey.Kintscher.

Messages (7)
msg296828 - (view) Author: Matthew Woodcraft (mattheww) Date: 2017-06-25 16:58
I have a system where the default locale is C.UTF-8, and en_US.UTF-8 is
not installed.

But locale.normalize() unhelpfully converts "C.UTF-8" to "en_US.UTF-8".

So the following crashes for me:

  python3.6 -c "import locale;locale.setlocale(locale.LC_ALL, ('C', 'UTF-8'))"

Similarly getdefaultlocale() returns ('en_US', 'UTF-8'), so this crashes too:

  export LANG=C.UTF-8
  unset LC_CTYPE
  unset LC_ALL
  unset LANGUAGE
  python3.6 -c "import locale;locale.setlocale(locale.LC_ALL, locale.getdefaultlocale())"

This behaviour is caused by a locale_alias entry in Lib/ . documents its addition but doesn't
provide a rationale.

I can see that it might be helpful to provide such a conversion if
C.UTF-8 doesn't exist and en_US.UTF-8 does, but the current code is
breaking modern correctly-configured systems for the benefit of old
misconfigured ones (C.UTF-8 shouldn't really be in the environment if it
isn't available on the system, after all).
msg297342 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-06-30 02:20
I'm honestly not sure how our Python level locale handling really works (I've mainly worked on the lower level C locale manipulation), so adding folks to the nosy list based on #20076 and #29571.

I agree we shouldn't be aliasing C.UTF-8 to en_US.UTF-8 though - we took en_US.UTF-8 out of the locale coercion fallback list in PEP 538 because it wasn't really right.
msg302981 - (view) Author: Matthew Woodcraft (mattheww) Date: 2017-09-25 22:37
I've investigated a bit more.

First, I've tried with Python 3.7.0a1 . As you'd expect, PEP 537 means
this behaviour now also occurs when no locale environment variables at
all are set.

Second, I've looked through a bit. I believe what it calls the
"aliasing engine" is applied for:

 - getlocale()
 - getdefaultlocale()
 - setlocale() when passed a tuple, but not when passed a string

This leads to some rather odd results.

With 3.7.0a1 and no locale environment variables:

  >>> import locale
  >>> locale.getlocale()
  ('en_US', 'UTF-8')

  # getlocale() is lying: the effective locale is really C.UTF-8
  >>> sorted("abcABC", key=locale.strxfrm)
  ['A', 'B', 'C', 'a', 'b', 'c']

Third, I've checked on a system which does have en_US.UTF-8 installed,
and (as you'd expect) instead of crashing it gives wrong results:

  >>> import locale
  >>> locale.setlocale(locale.LC_ALL, ('C', 'UTF-8'))
  >>> locale.getlocale()
  ('en_US', 'UTF-8')

  # now getlocale() is telling the truth, and the user isn't getting the
  # collation they requested
  >>> sorted("abcABC", key=locale.strxfrm)
  ['a', 'A', 'b', 'B', 'c', 'C']
msg302982 - (view) Author: Matthew Woodcraft (mattheww) Date: 2017-09-25 22:39
(For PEP 537 please read PEP 538, sorry)
msg347520 - (view) Author: Gordon Messmer (gordonmessmer) * Date: 2019-07-09 05:44
> I can see that it might be helpful to provide such a conversion if
> C.UTF-8 doesn't exist and en_US.UTF-8 does

That can't happen.  The "C" locale describes the behavior defined in the ISO C standard.  It's built-in to glibc (and should be for all other libc implementations).  All other locales require external support (i.e. /usr/lib/locale/<locale>)
msg347521 - (view) Author: Gordon Messmer (gordonmessmer) * Date: 2019-07-09 06:10
> I agree we shouldn't be aliasing C.UTF-8 to en_US.UTF-8 though

What can we do about reverting that change?  Python's current behavior causes unexpected exceptions, especially in containers.

I'm currently debugging test failures in a Python application that occur in Fedora rawhide containers.  Those containers don't have any locales installed.  The test software saves its current locale, changes the locale in order to run a test, and then restores the original.  Because Python is incorrectly reporting the original locale as "en_US", restoring the original fails.
msg347528 - (view) Author: Miro Hrončok (hroncok) * Date: 2019-07-09 08:42
>> C.UTF-8 doesn't exist and en_US.UTF-8 does
> That can't happen

It certainly can. Take for example RHEL 7 or 6.
Date User Action Args
2019-07-11 20:17:50Jeffrey.Kintschersetnosy: + Jeffrey.Kintscher
2019-07-09 08:42:49hroncoksetnosy: + vstinner, hroncok

messages: + msg347528
versions: + Python 3.8
2019-07-09 06:10:15gordonmessmersetmessages: + msg347521
2019-07-09 05:44:38gordonmessmersetnosy: + gordonmessmer
messages: + msg347520
2017-09-25 22:39:15matthewwsetmessages: + msg302982
2017-09-25 22:37:16matthewwsetmessages: + msg302981
versions: + Python 3.7
2017-06-30 02:20:13ncoghlansetnosy: + lemburg, benjamin.peterson, serhiy.storchaka
messages: + msg297342
2017-06-29 18:35:14r.david.murraysetnosy: + ncoghlan
2017-06-25 16:58:59matthewwcreate