Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make encoding="locale" uses locale encoding even in UTF-8 mode is enabled. #91156

Closed
methane opened this issue Mar 13, 2022 · 16 comments · Fixed by #70056
Closed

Make encoding="locale" uses locale encoding even in UTF-8 mode is enabled. #91156

methane opened this issue Mar 13, 2022 · 16 comments · Fixed by #70056
Labels
3.11 only security fixes topic-unicode

Comments

@methane
Copy link
Member

methane commented Mar 13, 2022

BPO 47000
Nosy @malemburg, @vstinner, @ezio-melotti, @methane, @eryksun
PRs
  • bpo-47000: Make io.text_encoding() respects UTF-8 mode #32003
  • bpo-47000: Add locale.getencoding() #32068
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2022-03-13.06:09:52.037>
    labels = ['expert-unicode', '3.11']
    title = 'Make encoding="locale" uses locale encoding even in UTF-8 mode is enabled.'
    updated_at = <Date 2022-04-04.02:47:05.537>
    user = 'https://github.com/methane'

    bugs.python.org fields:

    activity = <Date 2022-04-04.02:47:05.537>
    actor = 'methane'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Unicode']
    creation = <Date 2022-03-13.06:09:52.037>
    creator = 'methane'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 47000
    keywords = ['patch']
    message_count = 16.0
    messages = ['415028', '415118', '415146', '415147', '415219', '415240', '415767', '415768', '415769', '415851', '415867', '415922', '416329', '416330', '416332', '416649']
    nosy_count = 5.0
    nosy_names = ['lemburg', 'vstinner', 'ezio.melotti', 'methane', 'eryksun']
    pr_nums = ['32003', '32068']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue47000'
    versions = ['Python 3.11']

    @methane
    Copy link
    Member Author

    methane commented Mar 13, 2022

    Currently, encoding="locale" is just shortcut of encoding=locale.getpreferredencoding(False).

    encoding="locale" means that "locale encoding should be used here, even if Python default encoding is changed to UTF-8".

    I am not sure that UTF-8 mode becomes the default or not.
    But some user want to use UTF-8 mode to change default encoding in their Python environments without waiting Python default encoding changed.

    So I think encoding="locale" should use real locale encoding (ACP on Windows) regardless UTF-8 mode is enabled or not.

    Currently, UTF-8 mode affects to _Py_GetLocaleEncoding(). So it is difficult that make encoding="locale" ignores UTF-8 mode.
    Is it safe to use locale.getlocale(locale.LC_CTYPE)[1] or "UTF-8"?

    @methane methane added 3.11 only security fixes topic-unicode labels Mar 13, 2022
    @methane
    Copy link
    Member Author

    methane commented Mar 14, 2022

    I created a related topic on discuss.python.org.
    https://discuss.python.org/t/jep-400-utf-8-by-default-and-future-of-python/14246

    If we recommend PYTHONUTF8 as opt-in "UTF-8 by default", encoding="locale" should locale encoding in UTF-8 mode.

    If we don't change PYTHONUTF8 behavior, we need yet another option for opt-in "UTF-8 by default".

    @vstinner
    Copy link
    Member

    There are multiple "locale encodings":

    • "current" locale encoding: locale.nl_langinfo(locale.CODESET)
    • "Python" locale encoding: locale.getpreferredencoding(False), ignore the locale in UTF-8 Mode (always return "UTF-8"), ignore the locale on Android and VxWorks (always return "UTF-8")
    • Python "filesystem" encoding: similar to the Python locale encoding, but always use UTF-8 on Android, macOS and VxWorks

    Include/pyport.h:
    ---

    #if defined(__ANDROID__) || defined(__VXWORKS__)
       // Use UTF-8 as the locale encoding, ignore the LC_CTYPE locale.
       // See _Py_GetLocaleEncoding(), PyUnicode_DecodeLocale()
       // and PyUnicode_EncodeLocale().
    #  define _Py_FORCE_UTF8_LOCALE
    #endif
    
    #if defined(_Py_FORCE_UTF8_LOCALE) || defined(__APPLE__)
       // Use UTF-8 as the filesystem encoding.
       // See PyUnicode_DecodeFSDefaultAndSize(), PyUnicode_EncodeFSDefault(),
       // Py_DecodeLocale() and Py_EncodeLocale().
    #  define _Py_FORCE_UTF8_FS_ENCODING
    #endif

    See bpo-43552 "Add locale.get_locale_encoding() and locale.get_current_locale_encoding()" (rejected).

    Marc-Andre Lemburg dislikes locale.getpreferredencoding(False) API and suggested adding a new function locale.getencoding() with no argument:
    https://bugs.python.org/issue46659#msg412667

    @vstinner
    Copy link
    Member

    So I think encoding="locale" should use real locale encoding (ACP on Windows) regardless UTF-8 mode is enabled or not.

    If you want to change the default, would it be possible to add a function to get this encoding?

    @methane
    Copy link
    Member Author

    methane commented Mar 15, 2022

    I created another topic relating this issue.
    https://discuss.python.org/t/add-legacy-text-encoding-option-to-make-utf-8-default/14281

    If we add another option (e.g. legacy_text_encoding), we do not need to change UTF-8 mode behavior.

    @malemburg
    Copy link
    Member

    FWIW: I don't think the "locale" encoding is a good idea. Instead of
    trying to fix this to make it more usable, I'd suggest to deprecate
    and remove it again.

    When it comes to encodings, explicit is better than implicit.

    If an application wants to work with some user defined locale settings,
    it's better for the application to decide where to pick the locale
    settings from, e.g. the OS, the UI, an application config setting,
    etc.

    There are too many ways this can be done and trying to build
    magic to determine the "right" one is bound to fail in one way or
    another.

    @vstinner
    Copy link
    Member

    Is it safe to use locale.getlocale(locale.LC_CTYPE)[1] or "UTF-8"?

    I would like to deprecate getlocale(), see bpo-43557.

    @vstinner
    Copy link
    Member

    But some user want to use UTF-8 mode to change default encoding in their Python environments without waiting Python default encoding changed.

    IMO it's a different use case and it should be a different thing. Changing encoding="locale" today is too late, since it's already shipped in Python 3.10 (PEP-597).

    I proposed the "current locale" name to distinguish it from the existing "locale":

    • "current locale": LC_CTYPE locale encoding or ANSI code page
    • "locale": "UTF-8" in UTF-8 Mode, or the current locale

    The unclear part to me is if "current locale" must change if the LC_CTYPE locale is changed, or if it should be read once at startup and then never change.

    There *are* use case to really read the *current* LC_CTYPE locale encoding. There is already C API for that:

    • PyUnicode_EncodeLocale()
    • PyUnicode_DecodeLocale(), PyUnicode_DecodeLocaleAndSize()

    See also the "current_locale" parameter of the private API _Py_EncodeLocaleEx() and _Py_DecodeLocaleEx().

    @vstinner
    Copy link
    Member

    I propose:

    None of these functions do locale.setlocale(locale.LC_CTYPE, "") to get the user preferred encoding.

    Only the locale.getpreferredencoding() function uses locale.setlocale(locale.LC_CTYPE, "").

    Usage of locale.getpreferredencoding() should be discouraged in the documentation, but I don't think that it can be deprecated and scheduled for removal right now: too much code rely on it :-(

    ---

    So we have 3 encodings:

    • Python filesystem encoding
    • Locale encoding
    • Current locale encoding

    Examples of usage:

    • Python filesystem encoding:

      • os.fsdecode() / os.fsencode()
      • C: PyUnicode_EncodeFSDefault() / PyUnicode_DecodeFSDefault()
    • Locale encoding

      • _locale._get_locale_encoding()
      • On Unix, os.device_encoding()
      • To initialize PyConfig.stdio_encoding and PyConfig.filesystem_encoding
    • Current locale encoding

      • PyUnicode_EncodeLocale() / PyUnicode_DecodeLocale()
      • "current_locale" parameter of private _Py_EncodeLocaleEx() / _Py_DecodeLocaleEx()

    @methane
    Copy link
    Member Author

    methane commented Mar 23, 2022

    • sys.getfilesystemencoding(): Python filesystem encoding, return "UTF-8" if the Python UTF-8 Mode is enabled

    Yes, althoguh PYTHONLEGACYWINDOWSFSENCODING takes priority.

    • locale.getencoding(): Get the locale encoding, LC_CTYPE locale encoding or the Windows ANSI code page, *read at Python startup*. Ignore the Python UTF-8 Mode.

    I proposed locale.get_encoding() in the PEP-686. I will remove underscore if you don't like it.

    • locale.getencoding(current=True): Get the *current* locale encoding. The difference with locale.getencoding() is that on Unix, it gets the LC_CTYPE locale encoding at each call.

    Hmm, I don't add it to the PEP-686 because it is not relating to UTF-8 mode nor EncodingWarning.

    Since locale.getencoding() returns locale encoding on startup, how about this idea?

    • sys.getlocaleencoding() -- Get the locale encoding read at Python startup.
    • locale.getencoding() -- Get the current locale encoding.

    Note that we have sys.getdefaultencoding() and sys.getfilesystemencoding(). sys.getlocaleencoding() looks consistent with them.

    @vstinner
    Copy link
    Member

    sys.getlocaleencoding() versus locale.getencoding().

    For me, the Python locale module should use the C API to access the Unix locales like LC_CTYPE, nl_langinfo(CODESET), etc.

    The sys module are more for things specific to Python, like sys.getfilesystemencoding().

    Since sys.getlocaleencoding() would be a fixed value for the whole process life time, I agree that the sys module is a better place.

    I can write a PR adding sys.getlocaleencoding() if we agree on the API.

    @methane
    Copy link
    Member Author

    methane commented Mar 24, 2022

    I am not sure about we really need "locale encoding at Python startup".

    For this issue, I don't want to change encoding="locale" behavior except ignore UTF-8 mode. So what I want is "current locale encoding" or
    ANSI codepage on Windows.

    On the other hand, I know Eryk wants to support locale on Windows. So locale.get_encoding() might return current locale encoding (not ANSI codepage) even on Windows.
    If so, I will use sys.getlocaleencoding() to implement encoding="locale" to keep using ANSI codepage, instead of adding yet another "get locale encoding" function.

    @methane
    Copy link
    Member Author

    methane commented Mar 30, 2022

    @vstiner Since UTF-8 mode affects locale.getpreferredencoding(False), I need to decide alternative API in the PEP-686.

    If no objections, I will choose locale.get_encoding() for current locale encoding (ACP on Windows).

    See https://github.com/python/peps/pull/2470/files

    @malemburg
    Copy link
    Member

    Please see https://bugs.python.org/issue47000#msg415769 for what Victor
    suggested.

    In particular, the locale module uses the "no underscore" convention.
    Not sure whether it's good to start using snake case now, but I'm also
    not against it.

    I would like to reiterate my concern with the "locale" encoding, though.

    As mentioned earlier, I believe it adds too much magic. It would be better
    to leave this in the hands of the applications and not try to guess
    the correct encoding.

    It's better to expose easy to use APIs to access the various different
    settings and point users to those rather than try to do a best effort
    guess... explicit is better than implicit.

    After all, Mojibake potentially corrupts important data, without the
    alerting the user and that's not really what we should be after (e.g.
    UTF-8 is valid Latin-1 in most cases and this is a real problem we often
    run into in Germany with our Umlauts).

    @methane
    Copy link
    Member Author

    methane commented Mar 30, 2022

    Please see https://bugs.python.org/issue47000#msg415769 for what Victor
    suggested.

    Of course, I read it.

    In particular, the locale module uses the "no underscore" convention.
    Not sure whether it's good to start using snake case now, but I'm also
    not against it.

    Victor didn't mention about "no underscore" convention.
    I just want to see preference from others. I will remove the underscore.

    I would like to reiterate my concern with the "locale" encoding, though.

    As mentioned earlier, I believe it adds too much magic. It would be better
    to leave this in the hands of the applications and not try to guess
    the correct encoding.

    I don't recommend to use "locale" encoding for users.
    I strongly recommend to consider using "utf-8" instead.
    But "locale" encoding is needed when user don't want to change behavior of current application.
    It had been accepted by PEP-597 already.

    It's better to expose easy to use APIs to access the various different
    settings and point users to those rather than try to do a best effort
    guess... explicit is better than implicit.

    In some case, user need to decide "not change the encoding for now".
    If we don't provide "locale", it's difficult to change the default encoding to UTF-8.

    After all, Mojibake potentially corrupts important data, without the
    alerting the user and that's not really what we should be after (e.g.
    UTF-8 is valid Latin-1 in most cases and this is a real problem we often
    run into in Germany with our Umlauts).

    Changing the default encoding will temporary increase this risk.
    But after changing the default encoding to UTF-8, this risk will be reduced overwhelmingly.
    Most popular text editors, including VSCode, Atom, Sublime Text, Notepad.exe use UTF-8 by default.

    @methane
    Copy link
    Member Author

    methane commented Apr 4, 2022

    New changeset 4216dce by Inada Naoki in branch 'main':
    bpo-47000: Make io.text_encoding() respects UTF-8 mode (GH-32003)
    4216dce

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes topic-unicode
    Projects
    None yet
    Development

    Successfully merging a pull request may close this issue.

    3 participants