Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PYTHONFSENCODING environment variable #52868

Closed
malemburg opened this issue May 5, 2010 · 19 comments
Closed

Add PYTHONFSENCODING environment variable #52868

malemburg opened this issue May 5, 2010 · 19 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error

Comments

@malemburg
Copy link
Member

BPO 8622
Nosy @malemburg, @pitrou, @vstinner, @bitdancer, @florentx
Files
  • pythonfsencoding-2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/vstinner'
    closed_at = <Date 2010-08-25.08:36:47.611>
    created_at = <Date 2010-05-05.13:55:23.780>
    labels = ['interpreter-core', 'type-bug']
    title = 'Add PYTHONFSENCODING environment variable'
    updated_at = <Date 2010-08-25.08:36:47.609>
    user = 'https://github.com/malemburg'

    bugs.python.org fields:

    activity = <Date 2010-08-25.08:36:47.609>
    actor = 'vstinner'
    assignee = 'vstinner'
    closed = True
    closed_date = <Date 2010-08-25.08:36:47.611>
    closer = 'vstinner'
    components = ['Interpreter Core']
    creation = <Date 2010-05-05.13:55:23.780>
    creator = 'lemburg'
    dependencies = []
    files = ['18564']
    hgrepos = []
    issue_num = 8622
    keywords = ['patch', 'buildbot']
    message_count = 19.0
    messages = ['105030', '114194', '114208', '114210', '114212', '114265', '114278', '114344', '114353', '114358', '114408', '114421', '114699', '114702', '114720', '114721', '114852', '114874', '114887']
    nosy_count = 6.0
    nosy_names = ['lemburg', 'pitrou', 'vstinner', 'Arfrever', 'r.david.murray', 'flox']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue8622'
    versions = ['Python 3.2']

    @malemburg
    Copy link
    Member Author

    As discussed on bpo-8610, we need a way to override the automatic detection of the file system encoding - for much the same reasons we also do for the I/O encoding: the detection mechanism isn't fail-safe.

    We should add a new environment variable with the same functionality as PYTHONIOENCODING:

    PYTHONFSENCODING: Encoding[:errors] used for file system.
    

    @malemburg malemburg added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label May 5, 2010
    @vstinner
    Copy link
    Member

    Here you have a patch. It adds tests in test_sys.

    The tests are skipped on a non-ascii Python executable path because of bpo-8611 (see bpo-9425).

    @malemburg
    Copy link
    Member Author

    STINNER Victor wrote:

    STINNER Victor <victor.stinner@haypocalc.com> added the comment:

    Here you have a patch. It adds tests in test_sys.

    The tests are skipped on a non-ascii Python executable path because of bpo-8611 (see bpo-9425).

    Thanks for the patch.

    A couple of notes:

    • The command line -h explanation is missing from the patch.

    • The documentation should mention that the env var is only
      read once; subsequent changes to the env var are not seen
      by Python

    • If the codec lookup fails, Python should either issue a warning
      and then ignore the env var (using the get_codeset() API).

    • Unrelated to the env var, but still important: if get_codeset()
      does not return a known codec, Python should issue a warning
      before falling back to the default setting. Otherwise, a
      Python user will never know that there's an issue and this
      make debugging a lot harder.

    We should also add a new sys.setfilesystemencoding()
    function to make changes possible after Python startup. This
    would have to go on a separate ticket, though. Or is there
    some concept preventing this ?

    @vstinner
    Copy link
    Member

    The command line -h explanation is missing from the patch.

    done

    The documentation should mention that the env var is only
    read once; subsequent changes to the env var are not seen
    by Python

    I copied the PYTHONIOENCODING doc which doesn't mention that. Does Python re-read other environment variables at runtime? Anyway, I changed the doc to:

    + If this is set before running the intepreter, it overrides the encoding used
    + for the filesystem encoding (see :func:`sys.getfilesystemencoding`).

    I also changed PYTHONIOENCODING doc. Is it better?

    If the codec lookup fails, Python should either issue a warning

    Ok, done. I patched also get_codeset() and get_codec_name() to always set a Python error.

    ... and then ignore the env var (using the get_codeset() API).

    Good idea, done.

    Unrelated to the env var, but still important: if get_codeset()
    does not return a known codec, Python should issue a warning
    before falling back to the default setting. Otherwise, a
    Python user will never know that there's an issue and this
    make debugging a lot harder.

    It does already write a message to stderr, but it doesn't explain why it failed.

    I changed initfsencoding() to display two messages on get_codeset() error. First explain why get_codeset() failed (with the Python error) and then say that we fallback to utf-8.

    Full example (PYTHONFSENCODING error and simulated get_codeset() error):
    ---
    PYTHONFSENCODING is not a valid encoding:
    LookupError: unknown encoding: xxx
    Unable to get the locale encoding:
    ValueError: CODESET is not set or empty
    Unable to get the filesystem encoding: fallback to utf-8
    ---

    We should also add a new sys.setfilesystemencoding() ...

    No, I plan to REMOVE this function. sys.setfilesystemencoding() is dangerous because it introduces a lot of inconsistencies: this function is unable to reencode all filenames in all objects (eg. Python is unable to find filenames in user objects or 3rd party libraries). Eg. if you change the filesystem from utf8 to ascii, it will not be possible to use existing non-ascii (unicode) filenames: they will raise UnicodeEncodeError. As sys.setdefaultencoding() in Python2, I think that sys.setfilesystemencoding() is the root of evil :-)

    At startup, initfsencoding() sets the filesystem encoding using the locale encoding. Even for the startup process (with very few objects), it's very hard to find all filenames:

    • sys.path
    • sys.meta_path
    • sys.modules
    • sys.executable
    • all code objects
    • and I'm not sure that the list is complete

    See bpo-9630 for the details.

    To remove sys.setfilesystemencoding(), I already patched PEP-383 tests (r84170) and I will open a new issue. But it's maybe better to commit both changes (remove the function and PYTHONFSENCODING) at the same time.

    @vstinner
    Copy link
    Member

    To remove sys.setfilesystemencoding(), ... I will open a new issue

    done, issue bpo-9632

    @malemburg
    Copy link
    Member Author

    STINNER Victor wrote:

    STINNER Victor <victor.stinner@haypocalc.com> added the comment:

    > The command line -h explanation is missing from the patch.

    done

    > The documentation should mention that the env var is only
    > read once; subsequent changes to the env var are not seen
    > by Python

    I copied the PYTHONIOENCODING doc which doesn't mention that. Does Python re-read other environment variables at runtime? Anyway, I changed the doc to:

    • If this is set before running the intepreter, it overrides the encoding used
    • for the filesystem encoding (see :func:`sys.getfilesystemencoding`).

    I also changed PYTHONIOENCODING doc. Is it better?

    Yes, thanks.

    > If the codec lookup fails, Python should either issue a warning

    Ok, done. I patched also get_codeset() and get_codec_name() to always set a Python error.

    > ... and then ignore the env var (using the get_codeset() API).

    Good idea, done.

    > Unrelated to the env var, but still important: if get_codeset()
    > does not return a known codec, Python should issue a warning
    > before falling back to the default setting. Otherwise, a
    > Python user will never know that there's an issue and this
    > make debugging a lot harder.

    It does already write a message to stderr, but it doesn't explain why it failed.

    I changed initfsencoding() to display two messages on get_codeset() error. First explain why get_codeset() failed (with the Python error) and then say that we fallback to utf-8.

    Full example (PYTHONFSENCODING error and simulated get_codeset() error):
    ---
    PYTHONFSENCODING is not a valid encoding:
    LookupError: unknown encoding: xxx
    Unable to get the locale encoding:
    ValueError: CODESET is not set or empty
    Unable to get the filesystem encoding: fallback to utf-8
    ---

    Looks good !

    > We should also add a new sys.setfilesystemencoding() ...

    No, I plan to REMOVE this function. sys.setfilesystemencoding() is dangerous because it introduces a lot of inconsistencies: this function is unable to reencode all filenames in all objects (eg. Python is unable to find filenames in user objects or 3rd party libraries). Eg. if you change the filesystem from utf8 to ascii, it will not be possible to use existing non-ascii (unicode) filenames: they will raise UnicodeEncodeError. As sys.setdefaultencoding() in Python2, I think that sys.setfilesystemencoding() is the root of evil :-)

    Sorry, I wasn't aware we had such a function (and was looking at the
    wrong file so didn't find it).

    At startup, initfsencoding() sets the filesystem encoding using the locale encoding. Even for the startup process (with very few objects), it's very hard to find all filenames:

    • sys.path
    • sys.meta_path
    • sys.modules
    • sys.executable
    • all code objects
    • and I'm not sure that the list is complete

    See bpo-9630 for the details.

    To remove sys.setfilesystemencoding(), I already patched PEP-383 tests (r84170) and I will open a new issue. But it's maybe better to commit both changes (remove the function and PYTHONFSENCODING) at the same time.

    @vstinner
    Copy link
    Member

    Commited to 3.2 as r84182.

    @vstinner
    Copy link
    Member

    Oh, I realized that PYTHONFSENCODING is ignored on Windows and Mac OS X. r84201 and r84202 fix test_sys, and r84203 fixes the documentation and Python usage (hide PYTHONFSENCODING variable in Python help on Windows and Mac OS X).

    We might allow to override the filesystem encoding on Windows, but I don't think that it is a good idea because third party libraries will use anyway the mbcs encoding.

    @malemburg
    Copy link
    Member Author

    STINNER Victor wrote:

    STINNER Victor <victor.stinner@haypocalc.com> added the comment:

    Oh, I realized that PYTHONFSENCODING is ignored on Windows and Mac OS X. r84201 and r84202 fix test_sys, and r84203 fixes the documentation and Python usage (hide PYTHONFSENCODING variable in Python help on Windows and Mac OS X).

    This has to be changed: The env var needs to be respected on all
    platforms.

    @vstinner
    Copy link
    Member

    > Oh, I realized that PYTHONFSENCODING is ignored on Windows and Mac OS X.
    > r84201 and r84202 fix test_sys, and r84203 fixes the documentation and
    > Python usage (hide PYTHONFSENCODING variable in Python help on Windows
    > and Mac OS X).

    This has to be changed: The env var needs to be respected on all
    platforms.

    I don't think so.

    On Mac OS X, you cannot create a file with an invalid utf-8 name. The VFS uses
    utf-8:
    http://developer.apple.com/mac/library/qa/qa2001/qa1173.html

    Use a different encoding will raise error for the first non-ascii filename.

    --

    About Windows, Python3 uses the wide character API of Windows, except in some
    functions using third party libraries only providing a bytes API (eg.
    openssl). filenames are stored as unicode, even on removable media like CD-Rom
    or USB keys. I don't get the usecase here. Why would you like to change the
    filesystem encoding on Windows?

    @malemburg
    Copy link
    Member Author

    STINNER Victor wrote:
    > 
    > STINNER Victor <victor.stinner@haypocalc.com> added the comment:
    > 
    >>> Oh, I realized that PYTHONFSENCODING is ignored on Windows and Mac OS X.
    >>> r84201 and r84202 fix test_sys, and r84203 fixes the documentation and
    >>> Python usage (hide PYTHONFSENCODING variable in Python help on Windows
    >>> and Mac OS X).
    >>
    >> This has to be changed: The env var needs to be respected on all
    >> platforms.
    > 
    > I don't think so.
    > 
    > On Mac OS X, you cannot create a file with an invalid utf-8 name. The VFS uses 
    > utf-8:
    > http://developer.apple.com/mac/library/qa/qa2001/qa1173.html
    > 
    > Use a different encoding will raise error for the first non-ascii filename.
    > 
    > --
    > 
    > About Windows, Python3 uses the wide character API of Windows, except in some 
    > functions using third party libraries only providing a bytes API (eg. 
    > openssl). filenames are stored as unicode, even on removable media like CD-Rom 
    > or USB keys. I don't get the usecase here. Why would you like to change the 
    > filesystem encoding on Windows?

    Ok, point taken.

    Just please make sure that on other platforms such as BSD, Solaris,
    AIX, etc. that don't have this special Python support
    the env vars are honored.

    @vstinner
    Copy link
    Member

    Le jeudi 19 août 2010 22:40:53, vous avez écrit :

    Just please make sure that on other platforms such as BSD, Solaris,
    AIX, etc. that don't have this special Python support
    the env vars are honored.

    I added much more tests on the filesystem encoding:

    • (test_os) FSEncodingTests.test_encodings() tests different encoding values
      and check for some known values
    • (test_sys) SysModuleTest.test_pythonfsencoding() tests Python with C locale
      and check that the FS encoding is ascii, and test that setting
      PYTHONFSENCODING is understood by Python (run python with "import sys;
      print(sys.getfilesystemencoding())" and compare the output)

    These tests are skipped on Windows and Mac OS X. I also patched the doc
    (what's new / cmdline) to explain that PYTHONFSENCODING is not available
    (ignored) on these OSes.

    @florentx
    Copy link
    Mannequin

    florentx mannequin commented Aug 22, 2010

    This is still an issue on some buildbots:

    • since r84224 on OS X (PPC Leopard, x86 Tiger)
    • since r84182 on sparc solaris10 gcc, x86 FreeBSD, x86 FreeBSD 7.2

    The issue was fixed in r84201, r84202, r84203 for OS X buildbots only, but since r84224 it is failing again.

    @florentx florentx mannequin reopened this Aug 22, 2010
    @florentx florentx mannequin added the type-bug An unexpected behavior, bug, or error label Aug 22, 2010
    @vstinner
    Copy link
    Member

    I'm working on a fix for test_sys failure. test_os should not fail anymore.

    @bitdancer
    Copy link
    Member

    In an up to date checkout of py3k on Gentoo linux with LC_CTYPE=en_US.UTF-8, I get a failure in test_sys:

    ======================================================================
    FAIL: test_pythonfsencoding (test.test_sys.SysModuleTest)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/rdmurray/python/py3k/Lib/test/test_sys.py", line 605, in test_pythonfsencoding
        self.check_fsencoding(get_fsencoding(env), 'ascii')
      File "/home/rdmurray/python/py3k/Lib/test/test_sys.py", line 573, in check_fsencoding
        self.assertEqual(fs_encoding, expected)
    AssertionError: 'utf-8' != 'ascii'
    - utf-8
    + ascii

    @bitdancer
    Copy link
    Member

    Setting LC_ALL instead of LANG in the test fixes the problem.

    @vstinner
    Copy link
    Member

    r84308 should fix the last problems on Mac OS X, FreeBSD and Solaris.

    The last failure on test_sys is on Windows with test_undecodable_code (TypeError: Type str doesn't support the buffer API), which is unrelated.

    Reopen the issue if you see new failures.

    @bitdancer
    Copy link
    Member

    test_sys is still failing on my system where LC_CTYPE only is set to utf-8. Victor, do you want me to apply the LANG->LC_ALL change to the test?

    @bitdancer bitdancer reopened this Aug 25, 2010
    @vstinner
    Copy link
    Member

    test_sys is still failing on my system where LC_CTYPE
    only is set to utf-8

    Oh yes, test_sys fails if LC_ALL or LC_CTYPE is a locale using a different encoding than ascii (eg. LC_ALL=fr_FR.utf8). Fixed by r84314.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants