Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python3/POSIX: errors if file system encoding is None #52856

Closed
vstinner opened this issue May 4, 2010 · 17 comments
Closed

Python3/POSIX: errors if file system encoding is None #52856

vstinner opened this issue May 4, 2010 · 17 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode

Comments

@vstinner
Copy link
Member

vstinner commented May 4, 2010

BPO 8610
Nosy @malemburg, @loewis, @pitrou, @vstinner
Files
  • initfsencoding-4.patch
  • initfsencoding-5-utf-8.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-05-15.12:39:45.690>
    created_at = <Date 2010-05-04.11:58:56.531>
    labels = ['interpreter-core', 'expert-unicode']
    title = 'Python3/POSIX:  errors if file system encoding is None'
    updated_at = <Date 2010-05-15.12:39:45.689>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2010-05-15.12:39:45.689>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-05-15.12:39:45.690>
    closer = 'vstinner'
    components = ['Interpreter Core', 'Unicode']
    creation = <Date 2010-05-04.11:58:56.531>
    creator = 'vstinner'
    dependencies = []
    files = ['17247', '17252']
    hgrepos = []
    issue_num = 8610
    keywords = ['patch']
    message_count = 17.0
    messages = ['104924', '104984', '104986', '105002', '105005', '105007', '105008', '105010', '105031', '105175', '105176', '105188', '105189', '105195', '105212', '105247', '105805']
    nosy_count = 5.0
    nosy_names = ['lemburg', 'loewis', 'pitrou', 'vstinner', 'Arfrever']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue8610'
    versions = ['Python 3.2']

    @vstinner
    Copy link
    Member Author

    vstinner commented May 4, 2010

    On POSIX (but not on Mac OS X), Python3 calls get_codeset() to get the file system encoding. If this function fails, sys.getfilesystemencoding() returns None. PyUnicode_DecodeFSDefaultAndSize() fallbacks to utf-8 whereas subprocess fail:

    ...
    File "/home/SHARE/SVN/py3k/Lib/subprocess.py", line 670, in __init__
    restore_signals, start_new_session)
    File "/home/SHARE/SVN/py3k/Lib/subprocess.py", line 1101, in _execute_child
    executable_list = (fs_encode(executable),)
    File "/home/SHARE/SVN/py3k/Lib/subprocess.py", line 1088, in fs_encode
    return s.encode(fs_encoding, 'surrogateescape')
    TypeError: encode() argument 1 must be string, not None

    We have two choices: raise a fatal error if get_codeset() failed, or fallback to utf-8.

    On Windows and Mac OS X, get_codeset() shouldn't be called because the result is just dropped. We should call _PyCodec_Lookup(Py_FileSystemDefaultEncoding) instead to ensure that the file system encoding can be loaded.

    @vstinner vstinner added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode labels May 4, 2010
    @vstinner
    Copy link
    Member Author

    vstinner commented May 4, 2010

    Here is a patch for the first solution: display a fatal error if we are unable to get the locale encoding. It does always exit with a fatal error if nl_langinfo(CODESET) is not available (and Py_FileSystemDefaultEncoding is not set).

    I don't think it's a good idea to display an fatal error at runtime. If nl_langinfo(CODESET) is not available, configure should fail or we should fallback to an hardcoded encoding (ok but which one?).

    Extract of the nl_langinfo() manual page (on Linux):

    CONFORMING TO
    SUSv2, POSIX.1-2001.

    @vstinner
    Copy link
    Member Author

    vstinner commented May 5, 2010

    Patch for the second solution (fallback to utf-8 on get_codeset() failure):

    • create a subfunction initfsencoding() (Py_InitializeEx is already very long)
    • hardcode the encoding to utf-8 if nl_langinfo(CODESET) is missing
    • don't call get_codeset() on Windows or Mac OS X
    • call _PyCodec_Lookup(Py_FileSystemDefaultEncoding) if get_codeset() was not called (eg. on Windows) or if get_codeset() failed to ensure that the codec can be (and is) loaded: display a fatal error on failure

    Since I wrote patches for both solution, I can now compare correctly advantages and disavantages. I prefer initfsencoding() because it works on all cases and is simpler than no_fsencoding_error.patch.

    @malemburg
    Copy link
    Member

    STINNER Victor wrote:

    I don't think it's a good idea to display an fatal error at runtime. If nl_langinfo(CODESET) is not available, configure should fail or we should fallback to an hardcoded encoding (ok but which one?).

    If nl_langinfo(CODESET) fails, Python should assume the default
    locale, which is "C" on POSIX platforms. The "C" locale uses
    ASCII as encoding, so Python should use that as well. Note that the
    manpage for nl_langinfo() doesn't mention any errors that could
    be raised by it:

    """
    RETURN VALUE
    If no locale has been selected for the appropriate category, nl_langinfo() returns a
    pointer to the corresponding string in the "C" locale.

       If item is not valid, a pointer to an empty string is returned.
    
       This pointer may point to static data that may be overwritten on the next call to  nl_lang‐
       info() or setlocale(3).
    

    """

    As with all locale APIs, it is not thread-safe, which can become
    an issues if Python gets embedded in a multi-threaded application.

    There's also another issue: it's possible that nl_langinfo(CODESET)
    returns an encoding which is not known to Python.

    In such a case, it would be best to issue a warning to the
    user and fall back to ASCII as in the "C" locale case.

    Terminating Python with a fatal error would provide the worst of
    all user experiences. -1 on that.

    @vstinner
    Copy link
    Member Author

    vstinner commented May 5, 2010

    manpage for nl_langinfo() doesn't mention any errors that could
    be raised by it

    It's more about get_codeset(). This function can fail for different reasons:

    • nl_langinfo() result is an empty string: "If item is not valid, a pointer to an empty string is returned." say the manpage
    • _PyCodec_Lookup() failed: unable to import the encoding codec module, there is no such codec, codec machinery is broken, etc.
    • the codec has no "name "attribute
    • strdup() failure (no more memory)

    Do you think that you should fallback to ASCII if nl_langinfo() result is an empty string, and UTF-8 otherwise? get_codeset() failure is very unlikely, and I think that fallback to UTF-8 is just fine. A warning is printed to stderr, the user should try to understand why get_codeset() failed.

    You can at least reproduce the _PyCodec_Lookup() error with bpo-8611.

    My problem is also that the file system encoding is required (encoding != None) by os.environ mapping with my os.environb patch. (bpo-8603)

    @malemburg
    Copy link
    Member

    STINNER Victor wrote:

    STINNER Victor <victor.stinner@haypocalc.com> added the comment:

    > manpage for nl_langinfo() doesn't mention any errors that could
    > be raised by it

    It's more about get_codeset(). This function can fail for different reasons:

    • nl_langinfo() result is an empty string: "If item is not valid, a pointer to an empty string is returned." say the manpage
    • _PyCodec_Lookup() failed: unable to import the encoding codec module, there is no such codec, codec machinery is broken, etc.
    • the codec has no "name "attribute
    • strdup() failure (no more memory)

    Do you think that you should fallback to ASCII if nl_langinfo() result is an empty string, and UTF-8 otherwise? get_codeset() failure is very unlikely, and I think that fallback to UTF-8 is just fine. A warning is printed to stderr, the user should try to understand why get_codeset() failed.

    I think that using ASCII is a safer choice in case of errors.
    Using UTF-8 may be safe for reading file names, but it's not
    safe for creating files or directories.

    I also think that an application should be able to update the
    file system encoding in such an error case (and only in such a case).
    The application may have better knowledge about how it's being
    used and can provide correct encoding information by other means.

    @vstinner
    Copy link
    Member Author

    vstinner commented May 5, 2010

    I think that using ASCII is a safer choice in case of errors.

    I choosed UTF-8 to keep backward compatibility: PyUnicode_DecodeFSDefaultAndSize() uses utf-8 if Py_FileSystemDefaultEncoding==NULL. If the OS has no nl_langinfo(CODESET) function at all, Python3 uses utf-8.

    Using UTF-8 may be safe for reading file names, but it's not
    safe for creating files or directories.

    Well, I don't know. You are maybe right. And which encoding should be used if nl_langinfo(CODESET) function is missing: ASCII or UTF-8?

    UTF-8 is also an optimist choice: I bet that more and more OS will move to UTF-8.

    @malemburg
    Copy link
    Member

    STINNER Victor wrote:

    STINNER Victor <victor.stinner@haypocalc.com> added the comment:

    > I think that using ASCII is a safer choice in case of errors.

    I choosed UTF-8 to keep backward compatibility: PyUnicode_DecodeFSDefaultAndSize() uses utf-8 if Py_FileSystemDefaultEncoding==NULL. If the OS has no nl_langinfo(CODESET) function at all, Python3 uses utf-8.

    Ouch, that was a poor choice. In Python we have a tradition to
    avoid guessing, if possible. Since we cannot guarantee that the
    file system will indeed use UTF-8, it would have been safer to
    use ASCII. Not sure why this reasoning wasn't applied for
    the file system encoding.

    Nothing we can do about now, though.

    > Using UTF-8 may be safe for reading file names, but it's not
    > safe for creating files or directories.

    Well, I don't know. You are maybe right. And which encoding should be used if nl_langinfo(CODESET) function is missing: ASCII or UTF-8?

    UTF-8 is also an optimist choice: I bet that more and more OS will move to UTF-8.

    I think we should also add a new environment variable to override
    the automatic determination of the file system encoding, much like
    what we have for the I/O encoding:

    PYTHONFSENCODING: Encoding[:errors] used for file system.

    (that would need to go on a new ticket, though)

    @malemburg
    Copy link
    Member

    I've opened bpo-8622 for the env. var idea.

    @vstinner
    Copy link
    Member Author

    vstinner commented May 6, 2010

    New patch:

    • set encoding to ASCII on nl_langinfo(CODESET) failure
    • set encoding to ASCII if nl_langinfo(CODESET) is missing
    • document the changes

    NEWS entry: "Issue bpo-8610: Load file system codec at startup, and display a fatal error on failure. Set the file system encoding to ascii if getting the locale encoding failed, or if nl_langinfo(CODESET) function is missing."

    @vstinner
    Copy link
    Member Author

    vstinner commented May 6, 2010

    "I think that using ASCII is a safer choice in case of errors. (...) Ouch, that was a poor choice."

    Ok, you conviced me with your PYTHONFSENCODING suggestion (bpo-8622). Can you review my last patch please?

    @malemburg
    Copy link
    Member

    STINNER Victor wrote:

    STINNER Victor <victor.stinner@haypocalc.com> added the comment:

    "I think that using ASCII is a safer choice in case of errors. (...) Ouch, that was a poor choice."

    Ok, you conviced me with your PYTHONFSENCODING suggestion (bpo-8622). Can you review my last patch please?

    I don't think we can change the fallback encoding in 3.2. But you
    can start a discussion on python-dev, of course.

    The number of Python 3.x users is still small, so perhaps it's still
    possible to revert the choice and to use a safer default, which then
    results in encoding errors that the user can see in form of tracebacks
    rather than just by having actively checking the directory listing
    for strange symbols.

    Some comments on the patch:

    + fprintf(stderr,
    + "Unable to get the locale encoding: "
    + "fallback to utf-8\n");

    This would have to read "... to ASCII"

    + Py_FileSystemDefaultEncoding = "ascii";

    + codec = _PyCodec_Lookup(Py_FileSystemDefaultEncoding);
    + if (!codec) {
    + Py_FatalError(
    + "Py_Initialize: unable to load the file system codec");

    It's better to use the same approach as above for this situation
    as well.

    Fatal errors are just not user friendly and will likely cause
    bad vibes towards Python3.

    E.g.

    	fprintf(stderr,
    		"Unknown locale encoding: "
    		"fallback to ASCII\\n");
    

    You also need to change this line in pythonrun.c:

            /* reset file system default encoding */
            if (!Py_HasFileSystemDefaultEncoding) {
                    free((char*)Py_FileSystemDefaultEncoding);
                    Py_FileSystemDefaultEncoding = NULL;
            }

    I'm not sure what the purpose of Py_HasFileSystemDefaultEncoding
    is. If we define a default for the file system encoding,
    then why do we bother whether there is one ?

    In any case, initfsencoding() would always have to set that
    flag to 1.

    @vstinner
    Copy link
    Member Author

    vstinner commented May 7, 2010

    Le vendredi 07 mai 2010 11:19:52, vous avez écrit :

    > Ok, you conviced me with your PYTHONFSENCODING suggestion (bpo-8622). Can
    > you review my last patch please?

    I don't think we can change the fallback encoding in 3.2. But you
    can start a discussion on python-dev, of course.

    Ok, I will ask on python-dev.

    • fprintf(stderr,
      
    • 	"Unable to get the locale encoding: "
      
    • 	"fallback to utf-8\\n");
      

    This would have to read "... to ASCII"

    Fixed.

    • Py_FileSystemDefaultEncoding = "ascii";
      
    • codec = _PyCodec_Lookup(Py_FileSystemDefaultEncoding);

    • if (!codec) {

    • Py_FatalError(
      
    •     "Py_Initialize: unable to load the file system codec");
      

    It's better to use the same approach as above for this situation
    as well.

    I choosed to display a fatal error here to give a more revelant error message
    to the user. Without the check, _PyCodec_Lookup() will fail anyway, but later
    in a random function :-/

    The fatal error only occurs in critical situations: no more memory, import
    machinery completly broken (eg. bpo-8611), etc. In this case, fallback to ASCII
    doesn't help, it will also raise somewhere later.

    About nl_langinfo(CODESET): get_codeset() does already reject unknown
    encoding. So this call is only done on known encoding names.

    You also need to change this line in pythonrun.c:

        /* reset file system default encoding \*/
        if (!Py_HasFileSystemDefaultEncoding) {
                free((char*)Py_FileSystemDefaultEncoding);
                Py_FileSystemDefaultEncoding = NULL;
        }
    

    Fixed. This test only match if get_codeset() is used: I choosed to set the
    encoding to ASCII with Py_HasFileSystemDefaultEncoding=0.

    I'm not sure what the purpose of Py_HasFileSystemDefaultEncoding
    is.

    Its name doesn't help. It's just a flag to tell if free() should be called or
    not... (see _Py_SetFileSystemEncoding()).

    In any case, initfsencoding() would always have to set that
    flag to 1.

    initfsencoding() is a static function and it's only called by
    Py_InitializeEx(). Can Py_InitializeEx() be called multiple times?

    @malemburg
    Copy link
    Member

    STINNER Victor wrote:

    > + codec = _PyCodec_Lookup(Py_FileSystemDefaultEncoding);
    > + if (!codec) {
    > + Py_FatalError(
    > + "Py_Initialize: unable to load the file system codec");
    >
    > It's better to use the same approach as above for this situation
    > as well.

    I choosed to display a fatal error here to give a more revelant error message
    to the user. Without the check, _PyCodec_Lookup() will fail anyway, but later
    in a random function :-/

    The fatal error only occurs in critical situations: no more memory, import
    machinery completly broken (eg. bpo-8611), etc. In this case, fallback to ASCII
    doesn't help, it will also raise somewhere later.

    About nl_langinfo(CODESET): get_codeset() does already reject unknown
    encoding. So this call is only done on known encoding names.

    Ok, please add a comment to that part explaining why it can only
    fail as result of some other serious error and not because
    the Py_FileSystemDefaultEncoding was not found to be supported
    by Python.

    > I'm not sure what the purpose of Py_HasFileSystemDefaultEncoding
    > is.

    Its name doesn't help. It's just a flag to tell if free() should be called or
    not... (see _Py_SetFileSystemEncoding()).

    Interesting... I would associate a completely different meaning
    with it.

    Scratch my comment on that flag then.

    > In any case, initfsencoding() would always have to set that
    > flag to 1.

    initfsencoding() is a static function and it's only called by
    Py_InitializeEx(). Can Py_InitializeEx() be called multiple times?

    Well, it shouldn't be called multiple times, but then you never
    know how embedded Python interpreters are used.

    @vstinner
    Copy link
    Member Author

    vstinner commented May 7, 2010

    Version 4: I forgot #include <langinfo.h> in bltinmodule.c.

    @vstinner
    Copy link
    Member Author

    vstinner commented May 7, 2010

    I realized that fallback to ASCII instead of UTF-8 is not possible yet because of bpo-8611: if it fallbacks to ASCII, it's not more possible to run Python in a non-ASCII directory. I have a patch set fixing bpo-8611 but it's huge and complex. I will not be fixed quickly (if it would be possible someday to fix it).

    My new patch fallback to utf-8 instead of ascii, even if I agree that it would be better to fallback to ascii. Improve unicode, surrogates & friends is complex, and I prefer to fix bugs step by step. I propose to first ensure that Py_FileSystemEncoding is always set, and later write a new patch to fallback to ASCII instead of UTF-8.

    Patch version 5:

    • fallback to utf-8 instead of ascii
    • Set Py_FileSystemDefaultEncoding to NULL to Py_Finalize(): thanks to that, it should be possible to call Py_InitializeEx() (initfsencoding()) twice or more
    • initfsencoding() doesn't call _PyCodec_Lookup() on get_codeset() success because get_codeset() does already call it
    • explain that the fatal error is very unlikely

    @vstinner
    Copy link
    Member Author

    I commited the last patch (fall back to UTF-8): r81190 (3.x), blocked in 3.1 (r81191).

    I opened a new issue for the UTF-8/ASCII fallback: bpo-8725, because the ASCII fallback is a different issue.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants