This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author vstinner
Recipients methane, ncoghlan, steve.dower, vstinner
Date 2019-03-06.16:27:44
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1551889664.73.0.217164410962.issue36204@roundup.psfhosted.org>
In-reply-to
Content
> RE making UnixMain public, I'd rather the core runtime require a known encoding, rather than trying to detect it. We should move the call into the detection logic into Programs/python.c so that embedders have to opt-in to detection (many embedding scenarios will prefer to do their own encoding).

Unix is a very complex beast and Python makes it worse by adding more options (PEP 538 and PEP 540). Py_UnixMain() works "as expected": it uses the LC_CTYPE locale encoding.

If you want to force the usage of UTF-8, you can opt-in for UTF-8 mode: call putenv("PYTHONUTF8=1") before Py_UnixMain() for example.

You cannot pass an encoding to Py_UnixMain() because the implementation of Python heavily rely on the LC_CTYPE locale: see Py_DecodeLocale() and Py_EncodeLocale() functions. Anyway, Python must use the locale encoding to avoid mojibake. Python must use the codec from the C library: mbstowcs() and wcstombs() to be able to load its own codecs. Python has a few codecs implemented in C like ASCII, UTF-8 and Latin1, but locales are way more diverse than that. For example, ISO-8859-15 is used for "euro" locale variants. Example:

$ LANG=fr_FR.iso885915@euro python3 -c 'import sys; print(sys.getfilesystemencoding())'
iso8859-15

Python has a ISO-8859-15 codec, but it's implemented in pure Python. Python uses importlib to laod the codec, but how does Python decodes and encodes filenames to import Lib/encodings/iso8859_15.py? That's why mbstowcs()/wcstombs() and Py_DecodeLocale()/Py_EncodeLocale() come into the game :-) Enjoy:

PyObject*
PyUnicode_DecodeFSDefaultAndSize(const char *s, Py_ssize_t size)
{
    PyInterpreterState *interp = _PyInterpreterState_GET_UNSAFE();
    const _PyCoreConfig *config = &interp->core_config;
#if defined(__APPLE__)
    return PyUnicode_DecodeUTF8Stateful(s, size, config->filesystem_errors, NULL);
#else
    /* Bootstrap check: if the filesystem codec is implemented in Python, we
       cannot use it to encode and decode filenames before it is loaded. Load
       the Python codec requires to encode at least its own filename. Use the C
       implementation of the locale codec until the codec registry is
       initialized and the Python codec is loaded. See initfsencoding(). */
    if (interp->fscodec_initialized) {
        return PyUnicode_Decode(s, size,
                                config->filesystem_encoding,
                                config->filesystem_errors);
    }
    else {
        return unicode_decode_locale(s, size,
                                     config->filesystem_errors, 0);
    }
#endif
}
History
Date User Action Args
2019-03-06 16:27:44vstinnersetrecipients: + vstinner, ncoghlan, methane, steve.dower
2019-03-06 16:27:44vstinnersetmessageid: <1551889664.73.0.217164410962.issue36204@roundup.psfhosted.org>
2019-03-06 16:27:44vstinnerlinkissue36204 messages
2019-03-06 16:27:44vstinnercreate