This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Python 3 raises Unicode errors with the xxx.UTF-8 locale
Type: behavior Stage:
Components: Unicode Versions: Python 3.4
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, lemburg, loewis, ncoghlan, r.david.murray, rsc1975, serhiy.storchaka, vstinner
Priority: normal Keywords:

Created on 2015-08-31 08:42 by rsc1975, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (7)
msg249390 - (view) Author: Roberto Sánchez (rsc1975) Date: 2015-08-31 08:42
System: Python 3.4.2 on Linux Fedora 22

This issues is strongly related with: http://bugs.python.org/issue19846 But It isn't exactly the same case.

When I connect from my Mac OSX (using Terminal.app) to a Linux host with Fedora through ssh, the terminal session is forced to the OSX locale (default behavior in Terminal.app):

    [rob@fedora22 ~]$ locale
    locale: Cannot set LC_CTYPE to default locale: No such file or directory
    locale: Cannot set LC_MESSAGES to default locale: No such file or directory
    locale: Cannot set LC_ALL to default locale: No such file or directory
    LANG=es_ES.UTF-8
    LC_CTYPE="es_ES.UTF-8"
    LC_NUMERIC="es_ES.UTF-8"
    LC_TIME="es_ES.UTF-8"
    LC_COLLATE="es_ES.UTF-8"
    LC_MONETARY="es_ES.UTF-8"
    LC_MESSAGES="es_ES.UTF-8"
    LC_PAPER="es_ES.UTF-8"
    LC_NAME="es_ES.UTF-8"
    LC_ADDRESS="es_ES.UTF-8"
    LC_TELEPHONE="es_ES.UTF-8"
    LC_MEASUREMENT="es_ES.UTF-8"
    LC_IDENTIFICATION="es_ES.UTF-8"
    LC_ALL=

However the installed locales in Fedora are:

    [rob@fedora22 ~]$ localectl list-locales
    en_US
    en_US.iso88591
    en_US.iso885915
    en_US.utf8       <-- This is the default one

And if a launch python3 I get:

    [rob@fedora22 ~]$ python3
    Python 3.4.2 (default, Jul  9 2015, 17:24:30) 
    [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import os, codecs, sys, locale
    >>> locale.getpreferredencoding()
    'ANSI_X3.4-1968'
    >>> codecs.lookup(locale.getpreferredencoding()).name
    'ascii'
    >>> locale.getdefaultlocale()
    ('es_ES', 'UTF-8')
    >>> sys.stdout.encoding
    'ANSI_X3.4-1968'
    >>> sys.getfilesystemencoding()
    'ascii'
    >>> print('España')
      File "<stdin>", line 0
        
        ^
    SyntaxError: 'ascii' codec can't decode byte 0xc3 in position 11: ordinal not in range(128)


So, If I'm understanding correctly, If the current locale is not supported by the system then python fallback to ascii.

I can understand this behavior when the supported locales and the current one has different encoding, but if both of them are 'utf-8' It sounds reasonable that locale.getpreferredencoding() is set to 'utf-8'.

This case is causing that programs with CLI (Command Line Interface) fails, if you are using a third party like click lib, a RuntimeException is thrown by the own lib, I learned it by the hard way, the python3 CLI programs need a valid encoding to deal with stdin/stdout, and in this case all systems seems correctly configured about the encoding, I mean, this is a real case, there is no manual locale config modification, IMHO the current behavior seems a bit strict.
msg249399 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-08-31 11:38
It's not a bug on Python, but a bug on your system.

> New submission from Roberto Sánchez:
>     [rob@fedora22 ~]$ locale
>     locale: Cannot set LC_CTYPE to default locale: No such file or directory

This message means that the chosen locale doesn't exist.

>     LANG=es_ES.UTF-8
...
>     [rob@fedora22 ~]$ localectl list-locales
> ....
>     en_US.utf8       <-- This is the default one

LANG must be en_US.utf8.
msg249400 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2015-08-31 13:02
CPython inherits this behaviour from glibc's locale handling, so it's potentially worth raising the question further upstream. If anyone wanted to pursue that, looking at http://www.gnu.org/software/libc/development.html suggests to me that the appropriate starting point would be to email libc-help@sourceware.org and ask for advice.
msg249401 - (view) Author: Roberto Sánchez (rsc1975) Date: 2015-08-31 13:03
OK, I already knew that "It is not a bug", but the scenario seems quite common, connection to a Linux host from a Mac with Terminal.app and different locales (default behavior), so a bit of "magic" when the locale's encoding part is correct would help to deal with some Unicode issues in python3 scripts.

I just say that It would be a desirable enhancement, but I have no idea how to complex can be to change the current behavior, maybe It isn't worth the effort.
msg249404 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-08-31 15:28
I believe there is at least one open issue about Python adopting utf8 as the default instead of ASCII, and in any case, several conversations about how to deal with all this better.  This is just one example of a class of issues caused by the ASCII/C posix default locale, in different contexts.
msg249441 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2015-09-01 00:05
Looking again at the *specific* bug report here, I'm moving the resolution to "out of date", as it's actually the one we addressed in 3.5 by enabling surrogateescape by default on all of the standard streams when the OS claims the locale encoding is ASCII, not just stderr: http://bugs.python.org/issue19977

That allows us to at least correctly roundtrip data, even if the OS has given has bad encoding settings.

The problem with forcing UTF-8 more generally when the OS claims ASCII is that it may be the wrong thing to do and result in data corruption, especially on systems using East Asian codecs. Querying /etc/locale.conf [1] instead of relying on the nominal glibc locale settings should reliably give us correct encoding/locale information on modern Linux systems in cases like this one, where SSH has forwarded mismatched locale settings from a client system to a server shell session.

Another issue with relevant background discussion is issue #23993, which speculated on extending the "default to surrogateescape" idea to all open() calls when glibc claims the locale encoding is ASCII.

[1] http://www.freedesktop.org/software/systemd/man/locale.conf.html
msg249464 - (view) Author: Roberto Sánchez (rsc1975) Date: 2015-09-01 07:47
Ok, that makes sense, besides David pointed me about another opened issue that could help to solve cases like this: http://bugs.python.org/issue15216 If the encoding is wrong because the environment but we can change the initial stream encodings (in stdin/out) easily we have a powerful tool to adapt our scripts and patch broken locales like the generated with SSH sessions.
History
Date User Action Args
2022-04-11 14:58:20adminsetgithub: 69156
2015-09-01 07:47:23rsc1975setmessages: + msg249464
2015-09-01 00:05:15ncoghlansetresolution: not a bug -> out of date
messages: + msg249441
2015-08-31 15:28:38r.david.murraysetnosy: + r.david.murray
messages: + msg249404
2015-08-31 13:03:13rsc1975setmessages: + msg249401
2015-08-31 13:02:23ncoghlansetmessages: + msg249400
2015-08-31 11:38:43vstinnersetstatus: open -> closed
resolution: not a bug
messages: + msg249399
2015-08-31 09:02:46serhiy.storchakasetnosy: + lemburg, loewis, ncoghlan, serhiy.storchaka
2015-08-31 08:42:12rsc1975create