Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python 3 raises Unicode errors with the C locale #64045

Closed
Sworddragon mannequin opened this issue Nov 30, 2013 · 68 comments
Closed

Python 3 raises Unicode errors with the C locale #64045

Sworddragon mannequin opened this issue Nov 30, 2013 · 68 comments
Labels
topic-IO type-bug An unexpected behavior, bug, or error

Comments

@Sworddragon
Copy link
Mannequin

Sworddragon mannequin commented Nov 30, 2013

BPO 19846
Nosy @malemburg, @loewis, @terryjreedy, @ncoghlan, @pitrou, @vstinner, @larryhastings, @jwilk, @abadger, @bitdancer, @serhiy-storchaka
Files
  • test.py: Example script
  • asciilocale.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2013-12-09.10:42:16.755>
    created_at = <Date 2013-11-30.21:40:45.188>
    labels = ['type-bug', 'invalid', 'expert-IO']
    title = 'Python 3 raises Unicode errors with the C locale'
    updated_at = <Date 2017-12-18.14:38:09.078>
    user = 'https://bugs.python.org/Sworddragon'

    bugs.python.org fields:

    activity = <Date 2017-12-18.14:38:09.078>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2013-12-09.10:42:16.755>
    closer = 'vstinner'
    components = ['IO']
    creation = <Date 2013-11-30.21:40:45.188>
    creator = 'Sworddragon'
    dependencies = []
    files = ['32914', '33026']
    hgrepos = []
    issue_num = 19846
    keywords = ['patch']
    message_count = 68.0
    messages = ['204849', '204850', '204852', '205418', '205419', '205454', '205459', '205462', '205465', '205472', '205497', '205498', '205505', '205538', '205545', '205547', '205548', '205549', '205550', '205554', '205555', '205564', '205611', '205615', '205623', '205625', '205637', '205640', '205642', '205646', '205654', '205655', '205669', '205670', '205671', '205672', '205673', '205675', '205688', '205690', '205691', '205693', '205694', '205727', '205747', '205748', '205749', '205751', '205772', '205783', '205848', '205855', '205859', '205871', '206055', '206065', '206068', '206071', '206098', '206101', '206107', '206109', '206112', '206116', '206169', '232290', '283717', '308567']
    nosy_count = 14.0
    nosy_names = ['lemburg', 'loewis', 'terry.reedy', 'ncoghlan', 'pitrou', 'vstinner', 'larry', 'jwilk', 'a.badger', 'r.david.murray', 'Sworddragon', 'serhiy.storchaka', 'bkabrda', 'editor-buzzfeed']
    pr_nums = []
    priority = 'normal'
    resolution = 'not a bug'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue19846'
    versions = ['Python 3.3', 'Python 3.4']

    @Sworddragon
    Copy link
    Mannequin Author

    Sworddragon mannequin commented Nov 30, 2013

    It seems that print() and write() (and maybe other of such I/O functions) are relying on sys.getfilesystemencoding(). But these functions are not operating with filenames but with their content. In the attachments is an example script which demonstrates this problem. Here is what I get:

    sworddragon@ubuntu:~/tmp$ echo $LANG
    de_DE.UTF-8
    sworddragon@ubuntu:~/tmp$ python3 test.py
    sys.getdefaultencoding(): utf-8
    sys.getfilesystemencoding(): utf-8
    ä
    sworddragon@ubuntu:~/tmp$ LANG=C
    sworddragon@ubuntu:~/tmp$ python3 test.py
    sys.getdefaultencoding(): utf-8
    sys.getfilesystemencoding(): ascii
    Traceback (most recent call last):
      File "test.py", line 4, in <module>
        print('\xe4')
    UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 0: ordinal not in range(128)

    @Sworddragon Sworddragon mannequin added topic-IO type-bug An unexpected behavior, bug, or error labels Nov 30, 2013
    @bitdancer
    Copy link
    Member

    Victor can correct me if I'm wrong, but I believe that stdin/stdout/stderr all use the filesystem encoding because filenames are the most likely source of non-ascii characters on those streams. (Not a perfect solution, but the best we can do.)

    @vstinner
    Copy link
    Member

    "Filesystem encoding" is not a good name. You should read "OS encoding" or
    maybe "locale encoding".

    This encoding is the best choice for interopability with other (python2 or
    non python) programs. If you don't care of interoperabilty, force the
    encoding using PYTHONIOENCODING environment variable.

    @terryjreedy
    Copy link
    Member

    Unless there is an actually possibility of changing this, which I doubt since it is a choice and not a bug, and changing might break things, this issue should be closed.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 7, 2013

    I think the ship has sailed on this. We can't change our heuristic everyone someone finds a flaw in the current one.

    In the long term, all sensible UNIX systems should be configured for utf-8 filenames and contents, so it won't make a difference anymore.

    @vstinner
    Copy link
    Member

    vstinner commented Dec 7, 2013

    If you want to avoid the encoding errors, you can also use PYTHONIOENCODING=:replace or PYTHONIOENCODING=:backslashreplace in Python 3.4 to use the locale encoding, but use an error handler different than strict.

    @Sworddragon
    Copy link
    Mannequin Author

    Sworddragon mannequin commented Dec 7, 2013

    Using an environment variable is not the holy grail for this. On writing a non-single-user application you can't expect the user to set extra environment variables.

    If compatibility is the only reason in my opinion it would be much better to include something like sys.use_strict_encoding() which decides if print()/write() will use sys.getfilesystemencoding() or sys.getdefaultencoding().

    @pitrou
    Copy link
    Member

    pitrou commented Dec 7, 2013

    Using an environment variable is not the holy grail for this. On
    writing a non-single-user application you can't expect the user to set
    extra environment variables.

    I am not understanding why the user would have to set anything at all.
    What is the use case for per-user encoding settings?

    I understand that passing LANG=C (e.g. to disable a program's
    translations) forces ASCII instead of UTF-8, which is a flaw. Perhaps
    the filesystem encoding should be set to UTF-8 when the system locale
    says ASCII.

    (OTOH, it's IMHO a system bug that LANG=C forces the ASCII charset;
    we're not in the 80s anymore)

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Dec 7, 2013

    Antoine's suggestion of being a little more aggressive in choosing utf-8 over ascii as the OS API encoding sounds reasonable to me.

    I think we're getting to a point where a system claiming ASCII as the encoding to use is almost certainly a misconfiguration rather than a desired setting. If someone *really* means ASCII, they can force it for at least the std streams with PYTHONIOENCODING.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 7, 2013

    Here is a patch.

    $ LANG=C ./python -c "import os, sys, locale; print(sys.getfilesystemencoding(), sys.stdin.encoding, os.device_encoding(0), locale.getpreferredencoding())"

    -> Without the patch:
    ascii ANSI_X3.4-1968 ANSI_X3.4-1968 ANSI_X3.4-1968

    -> With the patch:
    utf-8 utf-8 utf-8 ANSI_X3.4-1968

    @vstinner
    Copy link
    Member

    vstinner commented Dec 7, 2013

    There was a previous try to use a file encoding different than the locale encoding and it introduces too many issues:
    https://mail.python.org/pipermail/python-dev/2010-October/104509.html
    "Inconsistencies if locale and filesystem encodings are different"

    Python uses the fact that the filesystem encoding is the locale encoding in various places. For example, Python uses the C codec (mbstowcs) to decode byte string from the filesystem encoding before Python codecs can be used. For example, the ISO 8859-15 codec is implemented in Python and so you need something during Python startup until the import machinery is ready and the codec is loaded (using ascii encoding is not correct).

    The C locale may use a different encoding. For example on AIX, the ISO 8859-1 encoding is used. On FreeBSD and Solaris, the ISO 8859-1 encoding is announced but the ASCII encoding is used in practice. Python forces the ascii encoding on FreeBSD to avoid other issues.

    I worked hard to have Python 3 working out of the box on all platform. In my opinion, going against the locale encoding in some cases (the C locale) would introduce more issues than it solves.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 7, 2013

    Python uses the fact that the filesystem encoding is the locale
    encoding in various places.

    The patch doesn't change that.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Dec 8, 2013

    Note that the *only* change Antoine's patch makes is that:

    • *if* the locale encoding is ASCII (or an alias for ASCII)
    • *then* Python sets the filesystem encoding to UTF-8 instead

    If the locale encoding is anything *other* than ASCII, then that will still be used as the filesystem encoding, so environments that use something other than ASCII for the C locale will retain their current behaviour.

    The rationale for this approach is based on the assumption that the *most likely* way to get a locale encoding of ASCII at this point in time is to use "LANG=C" on a system where the locale encoding is normally something more suited to a Unicode world (likely UTF-8).

    Will assuming utf-8 sometimes cause problems? Quite possibly. But assuming that the platform's claim to only support ASCII is correct causes serious usability problems, too.

    @vstinner
    Copy link
    Member

    vstinner commented Dec 8, 2013

    Antoine Pitrou added the comment:

    > Python uses the fact that the filesystem encoding is the locale
    > encoding in various places.
    The patch doesn't change that.

    Nick Coghlan added the comment:

    Note that the *only* change Antoine's patch makes is that:

    • *if* the locale encoding is ASCII (or an alias for ASCII)
    • *then* Python sets the filesystem encoding to UTF-8 instead

    If the locale encoding is ASCII, filesystem encoding (UTF-8) is
    different than the locale encoding.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Dec 8, 2013

    Yes, that's the point. *Every* case I've seen where the locale encoding has been reported as ASCII on a modern Linux system has been because the environment has been configured to use the C locale, and that locale has a silly, antiquated, encoding setting.

    This is particularly problematic when people remotely access a system with ssh and get given the C locale instead of something sensible, and then can't properly read the filesystem on that server.

    The idea of using UTF-8 instead in that case is to *change* (and hopefully reduce) the number of cases where things go wrong.

    • if no non-ASCII data is encountered, the choice of ASCII vs UTF-8 doesn't matter
    • if it's a modern Linux distro, then the real filesystem encoding is UTF-8, and the setting it provides for LANG=C is just plain *wrong*
    • there may be other cases where ASCII actually *is* the filesystem encoding (in which case they're going to have trouble anyway), or the real filesystem encoding is something other than UTF-8

    We're already approximating things on Linux by assuming every filesystem is using the *same* encoding, when that's not necessarily the case. Glib applications also assume UTF-8, regardless of the locale (http://unix.stackexchange.com/questions/2089/what-charset-encoding-is-used-for-filenames-and-paths-on-linux).

    At the moment, setting "LANG=C" on a Linux system *fundamentally breaks Python 3*, and that's not OK.

    @ncoghlan ncoghlan changed the title print() and write() are relying on sys.getfilesystemencoding() instead of sys.getdefaultencoding() Setting LANG=C breaks Python 3 Dec 8, 2013
    @vstinner
    Copy link
    Member

    vstinner commented Dec 8, 2013

    2013/12/8 Nick Coghlan <report@bugs.python.org>:

    Yes, that's the point. *Every* case I've seen where the locale encoding has been reported as ASCII on a modern Linux system has been because the environment has been configured to use the C locale, and that locale has a silly, antiquated, encoding setting.

    This is particularly problematic when people remotely access a system with ssh and get given the C locale instead of something sensible, and then can't properly read the filesystem on that server.

    The solution is to fix the locale, not to fix Python. For example,
    don't set LANG to C.

    From the C locale, you cannot guess the "correct" encoding. In
    Unicode, the general rule is to never try the encoding.

    The idea of using UTF-8 instead in that case is to *change* (and hopefully reduce) the number of cases where things go wrong.

    If the OS uses ISO-8859-1, forcing Python (filesystem) encoding to
    UTF-8 would produce invalid filenames, display mojibake and more
    generally produce data incompatible with other applicatons (who rely
    on the C locale, and so the ASCII encoding).

    • there may be other cases where ASCII actually *is* the filesystem encoding (in which case they're going to have trouble anyway), or the real filesystem encoding is something other than UTF-8

    As I wrote before, os.getfilesystemencoding() is *not* the filesystem
    encoding. It's the "OS" encoding used to decode any kind of data
    coming for the OS and used to encode back Python data to the OS. Just
    some examples:

    • DNS hostnames
    • Environment variables
    • Command line arguments
    • Filenames
    • user/group entries in the grp/pwd modules
    • almost all functions of the os module, they return various type of
      information (ttyname, ctermid, current working directory, login, ...)

    We're already approximating things on Linux by assuming every filesystem is using the *same* encoding, when that's not necessarily the case. Glib applications also assume UTF-8, regardless of the locale (http://unix.stackexchange.com/questions/2089/what-charset-encoding-is-used-for-filenames-and-paths-on-linux).

    If you use a different encoding but only just for filenames, you will
    get mojibake when you pass a filename on the command line or in an
    environment varialble.

    At the moment, setting "LANG=C" on a Linux system *fundamentally breaks Python 3*, and that's not OK.

    Getting ASCII filesystem encoding is annoying, but I would not say
    that it fundamentally breaks Python 3. If you want to do something,
    you should write documentation explaining how to configure properly
    Linux.

    @vstinner vstinner changed the title Setting LANG=C breaks Python 3 print() and write() are relying on sys.getfilesystemencoding() instead of sys.getdefaultencoding() Dec 8, 2013
    @pitrou
    Copy link
    Member

    pitrou commented Dec 8, 2013

    If you use a different encoding but only just for filenames, you will
    get mojibake when you pass a filename on the command line or in an
    environment varialble.

    That's not what the patch does.

    @vstinner
    Copy link
    Member

    vstinner commented Dec 8, 2013

    2013/12/8 Antoine Pitrou <report@bugs.python.org>:

    > Python uses the fact that the filesystem encoding is the locale
    > encoding in various places.

    The patch doesn't change that.

    You wrote: "-> With the patch: utf-8 utf-8 utf-8 ANSI_X3.4-1968", so
    os.get sys.getfilesystemencoding() != locale.getpreferredencoding().
    Or said differently, the filesystem encoding is different than the
    locale encoding.

    So please read again my following message which list real bugs:
    https://mail.python.org/pipermail/python-dev/2010-October/104509.html

    If you want to use a filesystem encoding different than the locale
    encoding, you have to patch Python where Python assumes that the
    filesystem encoding is the locale encoding, to fix all these bugs.
    Starts with:

    • PyUnicode_DecodeFSDefaultAndSize()
    • PyUnicode_EncodeFSDefault()
    • _Py_wchar2char()
    • _Py_char2wchar()

    It should be easier to change this function if the FS != locale only
    occurs when FS is "UTF-8". On Mac OS X, Python always use UTF-8 for
    the filesystem encoding, it doesn't care of the locale encoding. See
    _Py_DecodeUTF8_surrogateescape() in unicodeobject.c, you may reuse it.

    With a better patch, I can do more experiment to check if they are
    other tricky bugs.

    Does at least your patch pass the whole test suite with LANG=C?

    @serhiy-storchaka
    Copy link
    Member

    Setting sys.stderr encoding to UTF-8 on ASCII locale is wrong. sys.stderr has the backslashreplace error handler by default, so it newer fails and should newer produce non-ASCII data on ASCII locale.

    @larryhastings
    Copy link
    Contributor

    Antoine: are you characterizing this as a "bug" rather than a "new feature"?

    I'd like to see more of a consensus before something like this gets checked in. Right now I see a variety of opinions.

    When I think "conservative approach" and "knows about system encoding stuff", I think of Martin. Martin, can I ask you to form an opinion about this?

    @pitrou
    Copy link
    Member

    pitrou commented Dec 8, 2013

    Or said differently, the filesystem encoding is different than the
    locale encoding.

    Indeed, but the FS encoding and the IO encoding are the same.
    "locale encoding" doesn't really matter here, as we are assuming that it's wrong.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Dec 8, 2013

    Victor, people set "LANG=C" for all sorts of reasons, and we have no
    control over how operating systems define that locale. The user
    perception is "Python 3 doesn't work properly when you ssh into
    systems", not "Gee, I wish operating systems defined the C locale more
    sensibly".

    If you can come up with a more sensible guess than UTF-8, great, but
    believing the nonsense claim of "ASCII" from the OS is a
    not-insignificant usability issue on Linux, because it hoses *all* the
    OS API interactions. Yes, theoretically, using UTF-8 can cause
    problems, *if* the following all occur:

    • the OS *claims* the OS encoding is ASCII (so Python uses UTF-8 instead)
    • the OS encoding is *actually* something other than UTF-8
    • the program encounters non-ASCII data and writes it out to disk

    For fear of doing the wrong thing in that incredibly rare scenario,
    you're leaving Python broken under the C locale on *every* modern
    Linux distro as soon as it encounters non-ASCII data in an OS
    interface.

    @vstinner
    Copy link
    Member

    vstinner commented Dec 8, 2013

    "haypo: title: Setting LANG=C breaks Python 3 -> print() and write() are relying on sys.getfilesystemencoding() instead of sys.getdefaultencoding()"

    Oh, I didn't want to change the title of the issue, it's a bug in Roundup when I reply by email :-/

    @vstinner vstinner changed the title print() and write() are relying on sys.getfilesystemencoding() instead of sys.getdefaultencoding() Setting LANG=C breaks Python 3 Dec 8, 2013
    @vstinner
    Copy link
    Member

    vstinner commented Dec 8, 2013

    > Or said differently, the filesystem encoding is different than the
    > locale encoding.

    Indeed, but the FS encoding and the IO encoding are the same.
    "locale encoding" doesn't really matter here, as we are assuming that
    it's wrong.

    Oh, I realized that "FS encoding" term in not clear. When I wrote "FS encoding", I mean sys.getfilesystemencoding() which is mbcs on Windows, UTF-8 on Mac OS X and (currently) the locale encoding on other platforms (UNIX, ex: Linux/FreeBSD/Solaris/AIX).

    --

    IMO there are two different points in this issue:

    (a) which encoding should be used when the C locale is used: the encoding announced by the OS using nl_langinfo(CODESET) (current choice) or use an arbitrary optimistic "utf-8" encoding?

    (b) for technical reasons, Python reuses the C codec during Python initialization to decode and encode OS data, and so currently Python *must* use the locale encoding for its "filesystem encoding"

    Before being able to pronounce me on the point (a), I would like to see a patch fixing the point (b). I'm not against fixing point (b). I'm just saying that it's not trivial and obviously it must be fixed to change the status of point (a). I even gave clues to fix point (b).

    --

    asciilocale.patch has many issues. Try to run the Python test suite using this patch to see what I mean. Example of failures:

    ======================================================================
    FAIL: test_non_ascii (test.test_cmd_line.CmdLineTest)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/haypo/prog/python/default/Lib/test/test_cmd_line.py", line 140, in test_non_ascii
        assert_python_ok('-c', command)
      File "/home/haypo/prog/python/default/Lib/test/script_helper.py", line 69, in assert_python_ok
        return _assert_python(True, *args, **env_vars)
      File "/home/haypo/prog/python/default/Lib/test/script_helper.py", line 55, in _assert_python
        "stderr follows:\n%s" % (rc, err.decode('ascii', 'ignore')))
    AssertionError: Process return code is 1, stderr follows:
    Unable to decode the command from the command line:
    UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 12: surrogates not allowed

    ======================================================================
    FAIL: test_ioencoding_nonascii (test.test_sys.SysModuleTest)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/haypo/prog/python/default/Lib/test/test_sys.py", line 603, in test_ioencoding_nonascii
        self.assertEqual(out, os.fsencode(test.support.FS_NONASCII))
    AssertionError: b'' != b'\xc3\xa6'

    ======================================================================
    FAIL: test_nonascii (test.test_warnings.CEnvironmentVariableTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/haypo/prog/python/default/Lib/test/test_warnings.py", line 774, in test_nonascii
        "['ignore:Deprecaci\xf3nWarning']".encode('utf-8'))
    AssertionError: b"['ignore:Deprecaci\\udcc3\\udcb3nWarning']" != b"['ignore:Deprecaci\xc3\xb3nWarning']"

    ======================================================================
    FAIL: test_nonascii (test.test_warnings.PyEnvironmentVariableTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/haypo/prog/python/default/Lib/test/test_warnings.py", line 774, in test_nonascii
        "['ignore:Deprecaci\xf3nWarning']".encode('utf-8'))
    AssertionError: b"['ignore:Deprecaci\\udcc3\\udcb3nWarning']" != b"['ignore:Deprecaci\xc3\xb3nWarning']"

    test_warnings is probably bpo-9988, test_cmd_line failure is maybe bpo-9992.

    There are maybe other issues, the Python test suite only have a few tests for non-ASCII characters.

    --

    If anything is changed, I would prefer to have more than a few months of test to make sure that it doesn't break anything. So I set the version field to Python 3.5.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 9, 2013

    On dim., 2013-12-08 at 22:22 +0000, STINNER Victor wrote:

    (b) for technical reasons, Python reuses the C codec during Python
    initialization to decode and encode OS data, and so currently Python
    *must* use the locale encoding for its "filesystem encoding"

    Ahhh! Well indeed that's a bummer :-)

    asciilocale.patch has many issues. Try to run the Python test suite
    using this patch to see what I mean.

    I'm assuming much of this is due to (b) (all those tests seem to spawn
    external processes).

    It seems there is more work to do to get this right, but I'm not
    terribly interested either. Feel free to take over.

    @larryhastings
    Copy link
    Contributor

    The fact that write() uses sys.getfilesystemencoding() is either
    a defect or a bad design (I leave the decision to you).

    I have good news for you. write() does not cal sys.getfilesystemencoding(), because the encoding is set at the time the file is opened.

    But I'm still missing a reply to my suggestion. As I'm seeing it
    has no disadvantages to give the developer optionally the control.

    The programmer has all the control they need. They can open their own pipes using any encoding they like, and they can even reopen stdin/stdout with a different encoding if they wish.

    @Sworddragon
    Copy link
    Mannequin Author

    Sworddragon mannequin commented Dec 9, 2013

    If the environment variable is not enough

    There is a big difference between environment variables and internal calls: Environment variables are user-space while builtin/library functions are developer-space.

    I have good news for you. write() does not cal
    sys.getfilesystemencoding(), because the encoding is set at the time > the file is opened.

    Thanks for the clarification. I wished somebody had sayed me that after this sentence in my startpost: "It seems that print() and write() (and maybe other of such I/O functions) are relying on sys.getfilesystemencoding()."

    In theory this makes already my ticket invalid. Well, but now I would wish print() would allow to choose the encoding like open() too^^

    @vstinner
    Copy link
    Member

    vstinner commented Dec 9, 2013

    There is a big difference between environment variables and internal calls: Environment variables are user-space while builtin/library functions are developer-space.

    You can reopen sys.stdout with a different encoding and replace sys.stdout. I don't remember the exact recipe, it's tricky if you want portable code (you have to take care of newline).

    For example, I wrote:
    http://hg.python.org/cpython/file/ebe28dba4a78/Lib/test/regrtest.py#l895

    But you can avoid reopening the file using stdout.detach().

    In theory this makes already my ticket invalid. Well, but now I would wish print() would allow to choose the encoding like open() too^^

    Many options were already proposed. Another way, less convinient is to use sys.stdout.buffer.write("text".encode(encoding)) (you have to flush sys.stdout before, and flush the buffer after, to avoid inconsistencies between the TextIOWrapper and the BufferedWriter).

    @abadger
    Copy link
    Mannequin

    abadger mannequin commented Dec 9, 2013

    Ahh... added to the nosy list and bug closed all before I got up for the day ;-)

    A few words:

    I do think that python is broken here.

    I do not think that translating everything to utf-8 if ascii is the locale's encoding is the solution.

    As I would state it, the problem is that python's boundary with the OS is not yet uniform. If you set LC_ALL=C (note, LC_ALL=C is just one of multiple ways to beak things. For instance, LC_ALL=en_US.utf8 when dealing with latin-1 data will also break) then python will still *read* non-ascii data from the OS through some interfaces but it won't output it back to the OS. ie:

    $ mkdir unicode && cd unicode
    $ python3 -c 'open("ñ.txt".encode("latin-1"), "w").close()'
    $ LC_ALL=en_US.utf8 python3
    >>> import os
    >>> dir_listing = os.listdir('.')
    >>> for entry in dir_listing: print(entry)
    ... 
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf1' in position 0: surrogates not allowed

    Note that currently, input() and sys.stdin.read() won't read undecodable data so this is somewhat symmetrical but it seems to me that saying "everything that interfaces with the OS except the standard streams will use surrogateescape on undecodable bytes" is drawing a line in an unintuitive location.

    (A further note to serhiy.storchaka.... Your examples are not showing anything broken in other programs. xterm is refusing both input and output that is non-ascii. This is symmetric behaviour. ls is doing its best to display a *human-readable* representation of bytes that it cannot convert in the current encoding. It also provides the -b switch to see the octal values if you actually care. Think of this like opening a binary file in less or another pager.)

    (Further note for haypo -- On Fedora, the default of en_US is utf8, not ISO8859-1.)

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Dec 9, 2013

    There's a wrong assumption here: glib applications on Linux use UTF-8
    regardless of locale. That's the part I have a problem with: the assumption
    that the locale will correctly specify the encoding to use for OS APIs on
    modern Linux systems.

    It's simply not always true: some Linux distros would be better handled
    like OS X, where we always use UTF-8, regardless of what the locale says.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Dec 9, 2013

    Nick: which glib functions are you specifically referring to? Many of them don't deal with strings at all, and of those that do, many are encoding-agnostic (i.e. it is correct to claim that they operate on UTF-8, but likewise also correct that they operate on Latin-1, simultaneously).

    @pitrou
    Copy link
    Member

    pitrou commented Dec 9, 2013

    It's simply not always true: some Linux distros would be better handled
    like OS X, where we always use UTF-8, regardless of what the locale says.

    Perhaps by the 3.5 timeframe we can default to utf-8 on all Unix
    systems?

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Dec 9, 2013

    I confess I didn't independently verify the glib claim in the Stack
    Overflow post.

    However, Toshio's post covers the specific error case we were discussing at
    Flock (and I had misremembered), where the standard streams are classed as
    "OS APIs" for the purpose of deciding which encoding to use, but as user
    data APIs for the purpose of deciding which error handler to use. So the
    standard streams are only "sort of" an OS API, since they don't participate
    in the surrogateescape based round tripping guarantee by default.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Dec 10, 2013

    From what I read, it appears that the SO posting is plain wrong. Consider, for example,

    https://developer.gnome.org/glib/stable/glib-Character-Set-Conversion.html#g-filename-to-utf8

    # Converts a string which is in the encoding used by GLib for filenames
    # into a UTF-8 string. Note that on Windows GLib uses UTF-8 for filenames;
    # on other platforms, this function indirectly depends on the current locale.

    The SO author might have misread the part where it says that glib uses UTF-8 *on Windows* (instead of the braindead "ANSI" encoding indirection).

    @vstinner
    Copy link
    Member

    2013/12/10 Martin v. Löwis <report@bugs.python.org>:

    >From what I read, it appears that the SO posting is plain wrong. Consider, for example,

    https://developer.gnome.org/glib/stable/glib-Character-Set-Conversion.html#g-filename-to-utf8

    Converts a string which is in the encoding used by GLib for filenames

    into a UTF-8 string. Note that on Windows GLib uses UTF-8 for filenames;

    on other platforms, this function indirectly depends on the current locale.

    The SO author might have misread the part where it says that glib uses UTF-8 *on Windows* (instead of the braindead "ANSI" encoding indirection).

    I wrote some notes about glib here:
    http://unicodebook.readthedocs.org/en/latest/libraries.html#the-glib-library

    g_filename_from_utf8() uses the g_get_filename_charsets() encoding.
    g_get_filename_charsets() is the ANSI code page on Windows and the
    locale encoding on Linux, except if G_FILENAME_ENCODING or
    G_BROKEN_FILENAMES environment variables are set.

    glib has a nice g_filename_display_name() function.

    @abadger
    Copy link
    Mannequin

    abadger mannequin commented Dec 10, 2013

    Looking at the glib code, this looks like the SO post is closer to the truth. The API documentation for g_filename_to_utf8() is over-simplified to the point of confusion. This section of the glib API document is closer to what the code is doing: https://developer.gnome.org/glib/stable/glib-Character-Set-Conversion.html#file-name-encodings

    • When encoding matters, glib and gtk functions will assume that char*'s that you pass to them point to strings which are encoded in utf-8.
    • When char* are not utf8 you are responsible for converting them to utf8 to be used by the glib functions (if encoding matters).
    • glib provides g_filename_to_utf8() for the special case of transforming filenames into the encoding that glib expects. (Presumably because glib and gtk deal with non-utf8 unicode filenames more often than the equivalent environment variables, command line switches, etc).
    • Contrary to the API docs for g_filename_to_utf8(), g_filename_to_utf8() will simply return a copy of the byte string it was passed unless G_FILENAME_ENCODING or G_BROKEN_FILENAMES is set. If those are set, then the value of G_FILENAME_ENCODING might be used to attempt to decode the filename or the encoding specified in the user's locale might be used.

    @Haypo, I'm pretty sure from reading the code for g_get_filename_charsets() that you have the conditionals reversed. What I'm seeing is:

    if G_FILENAME_ENCODING:
        charset = the first charset listed in G_FILENAME_ENCODING
        if charset == '@locale':
            charset = charset of user's locale
    elif G_BROKEN_FILENAMES:
        charset = charset of user's locale
    else:
        charset = 'UTF-8'

    @vstinner
    Copy link
    Member

    2013/12/10 Toshio Kuratomi <report@bugs.python.org>:

    if G_FILENAME_ENCODING:
    charset = the first charset listed in G_FILENAME_ENCODING
    if charset == '@Locale':
    charset = charset of user's locale
    elif G_BROKEN_FILENAMES:
    charset = charset of user's locale
    else:
    charset = 'UTF-8'

    g_get_filename_charsets() returns a list of encodings. For the last
    case (else:), it uses ['utf-8', local_encoding] on UNIX. It's reliable
    because the utf-8 encoding has a nice feature, the utf-8 decoder fails
    if the byte string is not a valid utf-8 string.

    It would interesting to test this approach (try utf-8 or use the
    locale encoding) in
    PyUnicode_DecodeFSDefault/PyUnicode_EncodeFSDefault and
    _Py_char2wchar/_Py_wchar2char.

    @vstinner
    Copy link
    Member

    It would interesting to test this approach (try utf-8 or use the locale encoding) ...

    Oh, it may be easy to implement it for decoders, but what about
    encoders? Should os.fsencode() always use UTF-8??

    @abadger
    Copy link
    Mannequin

    abadger mannequin commented Dec 10, 2013

    Yes, it returns a list but unless I'm missing something in the general case it's the caller's responsibility to loop through the charsets to test for failure and try again. This is not done automatically.

    In the specific case we're talking about, first get_filename_charset() decides to only return the first entry in the list of charsets: list.https://git.gnome.org/browse/glib/tree/glib/gconvert.c#n1118

    and then g_filename_to_utf8() disregards the charsets altogether because it sees that the filename is supposed to be utf-8 https://git.gnome.org/browse/glib/tree/glib/gconvert.c#n1160

    @Sworddragon
    Copy link
    Mannequin Author

    Sworddragon mannequin commented Dec 13, 2013

    > The fact that write() uses sys.getfilesystemencoding() is either
    > a defect or a bad design (I leave the decision to you).

    I have good news for you. write() does not cal sys.getfilesystemencoding(), because the encoding is set at the time the > file is opened.

    Now after some researching I see I wasn't wrong at all. I should've been sayed:

    "The fact that write() -> open() relies on sys.getfilesystemencoding() (respectively locale.getpreferredencoding()) at default as encoding is either a defect or a bad design (I leave the decision to you)."

    Or am I overlooking something?

    @larryhastings
    Copy link
    Contributor

    "The fact that write() -> open() relies on sys.getfilesystemencoding()
    (respectively locale.getpreferredencoding()) at default as encoding is
    either a defect or a bad design (I leave the decision to you)."

    Or am I overlooking something?

    First, you should probably just drop mentioning write() or print() or any of the functions that actually perform I/O. The crucial decisions about decoding are made inside open().

    Second, open() is implemented in C. It cannot "rely on sys.getfilesystemencoding()" as it never calls it. Internally, sys.getfilesystemencoding() simply returns a C global called Py_FileSystemDefaultEncoding. But open() doesn't examine that, either.

    Instead, open() determines the default encoding by calling the same function that's used to initialize Py_FileSystemDefaultEncoding: get_locale_encoding() in Python/pythonrun.c. Which on POSIX systems calls the POSIX function nl_langinfo().

    If you want to see the actual mechanisms involved, you should read the C source code in Modules/_io in the Python trunk. open() is implemented as the C function io_open() in _iomodule.c. When it opens a file in text mode without an explicit encoding, it wraps it in a TextIOWrapper object; the __init__ function for this class is the C function textiowrapper_init() in textio.c.

    As for your assertion that this is "either a defect or a bad design": I leave the critique of that to others.

    @Sworddragon
    Copy link
    Mannequin Author

    Sworddragon mannequin commented Dec 13, 2013

    Instead, open() determines the default encoding by calling the same function that's used to initialize Py_FileSystemDefaultEncoding: get_locale_encoding() in Python/pythonrun.c. Which on POSIX systems calls the POSIX function nl_langinfo().

    open() will use at default the encoding of nl_langinfo() as sys.getfilesystemencoding() does on *nix. This is the part that looks dirty to me. As soon as LANG is set to C open() will rely on 'ascii' due to nl_langinfo() like sys.getfilesystemencoding() does too.

    @ncoghlan
    Copy link
    Contributor

    There's an alternative to trying to force a different encoding for the
    standard streams when the OS claims ASCII as the OS encoding: we can
    default to surrogateescape as the error handler, on the assumption that
    whatever the *real* OS encoding is, it definitely isn't ASCII.

    That means we'll still complain about displaying improperly encoded data
    when the OS suggests a plausible encoding, but we won't fail entirely just
    because someone enabled (deliberately or accidentally) the POSIX locale.

    @Sworddragon
    Copy link
    Mannequin Author

    Sworddragon mannequin commented Dec 13, 2013

    By the way I have found a valid use case for LANG=C. udev and Upstart are not setting LANG which will result in the ascii encoding for invoked Python scripts. This could be a problem since these applications are commonly dealing with non-ascii filesystems.

    @vstinner
    Copy link
    Member

    By the way, Java behaves as Python: with LANG=C, Java uses ASCII:

    http://stackoverflow.com/questions/13415975/cant-read-utf-8-filenames-when-launched-as-an-upstart-service

    udev and Upstart are not setting LANG

    So it's an issue in udev and Upstart. See for example:
    https://bugs.launchpad.net/ubuntu/+source/upstart/+bug/1235483
    https://bugs.launchpad.net/ubuntu-translations/+bug/1208272

    I found examples using "LANG=$LANG ..." when running a command in Upstart for example. I found another example using:

    if [ -r /etc/default/locale ]; then
    . /etc/default/locale
    export LANG LANGUAGE
    elif [ -r /etc/environment ]; then
    . /etc/environment
    export LANG LANGUAGE
    fi
    

    @Sworddragon
    Copy link
    Mannequin Author

    Sworddragon mannequin commented Dec 13, 2013

    https://bugs.launchpad.net/ubuntu/+source/upstart/+bug/1235483

    After opening many hundred tickets I would say: With luck this ticket will get a response within the next year. But in the worst case it will be simply refused.

    I found examples using "LANG=$LANG

    This Upstart script:

    "exec echo LANG=$LANG > /tmp/test.txt"

    Will result in the following:

    root@ubuntu:# start test
    test stop/waiting
    root@ubuntu:
    # cat /tmp/test.txt
    LANG=

    At least in this example I'm getting on my system an empty LANG.

    @abadger
    Copy link
    Mannequin

    abadger mannequin commented Dec 13, 2013

    It's not a bug for upstart, systemd, sysvinit, cron, etc to use LANG=C. The POSIX locale is the only locale guaranteed to exist on a system. Therefore these low level services should be using LANG=C. Embedded systems, thin clients, and other low memory or low disk devices may benefit from shipping without any locales.

    @vstinner
    Copy link
    Member

    I created the issue bpo-19977 as a follow up of this one: "Use surrogateescape error handler for sys.stdout on UNIX for the C locale".

    @vstinner
    Copy link
    Member

    I propose to modify the error handler, the encoding cannot be modified. See my following message explaining why it's not possible to change the encoding:
    http://bugs.python.org/issue19846#msg205675

    @ncoghlan
    Copy link
    Contributor

    Thanks Victor - I now agree that trying to guess another encoding is a bad idea, and that enabling surrogateescape for the standard streams under the C locale is a better way to go.

    @terryjreedy
    Copy link
    Member

    Since Viktor's alternative in bpo-19977 has been applied, should this issue be closed?

    @ncoghlan
    Copy link
    Contributor

    Also see http://bugs.python.org/issue28180 for a more recent proposal to tackle this by coercing the C locale to the C.UTF-8 locale

    @vstinner
    Copy link
    Member

    Follow-up: the PEP-538 (bpo-28180) and PEP-540 (bpo-29240) have been accepted and implemented in Python 3.7!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-IO type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    8 participants