Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add locale.get_locale_encoding() and locale.get_current_locale_encoding() #87718

Closed
vstinner opened this issue Mar 19, 2021 · 32 comments
Closed
Labels
3.10 only security fixes stdlib Python modules in the Lib dir

Comments

@vstinner
Copy link
Member

BPO 43552
Nosy @malemburg, @vstinner, @methane, @eryksun
PRs
  • bpo-43552: Add locale.get_current_locale_encoding() #24931
  • Files
  • encodings.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-06-29.00:19:02.728>
    created_at = <Date 2021-03-19.09:17:13.541>
    labels = ['library', '3.10']
    title = 'Add locale.get_locale_encoding() and locale.get_current_locale_encoding()'
    updated_at = <Date 2021-06-29.00:19:02.727>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2021-06-29.00:19:02.727>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-06-29.00:19:02.728>
    closer = 'vstinner'
    components = ['Library (Lib)']
    creation = <Date 2021-03-19.09:17:13.541>
    creator = 'vstinner'
    dependencies = []
    files = ['49894']
    hgrepos = []
    issue_num = 43552
    keywords = ['patch']
    message_count = 32.0
    messages = ['389057', '389062', '389063', '389064', '389065', '389066', '389068', '389069', '389070', '389072', '389074', '389076', '389079', '389080', '389082', '389083', '389087', '389088', '389089', '389090', '389091', '389093', '389098', '389100', '389101', '389102', '389118', '389131', '389136', '389159', '389195', '396686']
    nosy_count = 4.0
    nosy_names = ['lemburg', 'vstinner', 'methane', 'eryksun']
    pr_nums = ['24931']
    priority = 'normal'
    resolution = 'rejected'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue43552'
    versions = ['Python 3.10']

    @vstinner
    Copy link
    Member Author

    I propose to add two new functions:

    • locale.get_locale_encoding(): it's exactly the same than locale.getpreferredencoding(False).

    • locale.get_current_locale_encoding(): always get the current locale encoding. Read the ANSI code page on Windows, or nl_langinfo(CODESET) on other platforms. Ignore the UTF-8 Mode. Don't always return "UTF-8" on macOS, Android, VxWorks.

    Technically, locale.get_locale_encoding() would simply expose _locale.get_locale_encoding() that I added recently. It calls the new private _Py_GetLocaleEncoding() function (which has no argument).

    By the way, Python requires nl_langinfo(CODESET) to be built. It's not a new requirement of Python 3.10, but I wanted to note that, I noticed it when I implemented _locale.get_locale_encoding() :-)

    Python has a bad habit of lying to the user: locale.getpreferredencoding(False) is *NOT* the current locale encoding in multiple cases.

    • locale.getpreferredencoding(False) always return "UTF-8" on macOS, Android and VxWorks
    • locale.getpreferredencoding(False) always return "UTF-8" if the UTF-8 Mode is enabled
    • otherwise, it returns the current locale encoding: ANSI code page on Windwos, or nl_langinfo(CODESET) on other platforms

    Even if locale.getpreferredencoding(False) already exists, I propose to add locale.get_locale_encoding() because I dislike locale.getpreferredencoding() API. By default, this function sets temporarily LC_CTYPE to the user preferred locale. It can cause mojibake in other threads since setlocale(LC_CTYPE, "") affects all threads :-( Calling locale.getpreferredencoding(), rather than locale.getpreferredencoding(False), is not what most people expect. This API can be misused.

    On the other side, locale.get_locale_encoding() does exactly what it says: only *get* the encoding, don't *set* temporarily a locale to something else.

    By the way, the locale.localeconv() function can change temporarily LC_CTYPE locale to the LC_MONETARY locale which can cause other threads to use the wrong LC_CTYPE locale! But this is a different issue.

    @vstinner vstinner added 3.10 only security fixes stdlib Python modules in the Lib dir labels Mar 19, 2021
    @malemburg
    Copy link
    Member

    On 19.03.2021 10:17, STINNER Victor wrote:

    New submission from STINNER Victor <vstinner@python.org>:

    I propose to add two new functions:

    • locale.get_locale_encoding(): it's exactly the same than locale.getpreferredencoding(False).

    • locale.get_current_locale_encoding(): always get the current locale encoding. Read the ANSI code page on Windows, or nl_langinfo(CODESET) on other platforms. Ignore the UTF-8 Mode. Don't always return "UTF-8" on macOS, Android, VxWorks.

    I'm not sure whether this would improve the situation much.

    The problem is that the locale module is meant to expose the lib C
    locale settings, but many of the recent additions actually do something
    completely different: they look into the process and user environment
    and try to determine external settings, which are not reflected in
    the lib C locale settings.

    I had added locale.getdefaultlocale() to give applications a chance
    to determine the locale setting defined by the process environment
    *without* calling setlocale(LC_ALL, '') and causing problems
    in other threads. I used the X11 database for locale encodings,
    which was the closest you could get to in terms of a standard for
    encodings at the time (around 2000).

    Part of the return value is the encoding, which would be set.

    Martin later added locale.getpreferredencoding(), which tries to
    determine the encoding in a different way new way, based on
    nl_langset(CODEINFO). As you mentioned, this intention was broken
    on several platforms by forcing UTF-8 as output. And in many cases,
    the API had to call setlocale() as well, causing the thread problems.

    However, the problem with nl_langset(CODEINFO) is the same as
    with setlocale(): it returns the current state of the lib C
    settings, which may well point to the 'C' locale. Not the ones
    the user has configured in the OS environment. So while you get
    an encoding defined by lib C for the current locale settings
    (without guessing it as with locale.getdefaultlocale()), you
    still don't get what the user really wants to use.

    Unfortunately, lib C does not provide a way to query the locale
    database without changing the locale settings at the same time.
    This is the main issue we're facing.

    Now, the correct way in all this would be to just call
    setlocale(LC_ALL, '') at the start of the application and
    not try to apply all the magic to get around this. But this
    has to be done by the application and not Python (which may
    well be embedded into some other application).

    I'd suggest to add a single new API:

    locale.getencoding()

    which interfaces to nl_langinfo(CODESET) or the Windows code
    page and does not try to do any magic, ie. does *not* call
    setlocale(). It needs to return what the lib C currently
    knows and uses as encoding.

    locale.getpreferredencoding() should then be deprecated.

    It does not make sense to pretend to query information which is
    not really directly available from the lib C locale system.

    And the documentation should point out that applications should
    call setlocale(LC_ALL, '') when they start up, if they want to
    get the lib C locale, and thus Python locale module, setup to
    work based on what the user really wants -- instead of just
    guessing at this.

    PS: The locale module normally does not use underscores in
    function names, so it's not a good idea to add more.

    @malemburg malemburg changed the title Add locale.get_locale_encoding() and locale.get_current_locale_encoding() Add locale.get_locale_encoding() and locale.get_current_locale_encoding() Mar 19, 2021
    @malemburg malemburg changed the title Add locale.get_locale_encoding() and locale.get_current_locale_encoding() Add locale.get_locale_encoding() and locale.get_current_locale_encoding() Mar 19, 2021
    @vstinner
    Copy link
    Member Author

    I created this issue while reviewing the implementation of the PEP-597: PR 19481.

    Copy of my comments on the PR related to this issue.

    _locale.get_locale_encoding() calls _Py_GetLocaleEncoding() which returns UTF-8 if the Python UTF-8 Mode is enabled.

    Maybe the function could have a flag: please don't lie to me and return the current locale encoding ;-)

    Or we could add a function to get the *current* locale encoding: **locale.get_current_locale_encoding()**.

    This one would ignore the UTF-8 Mode and call nl_langinfo(CODESET). There are APIs to use the *current* locale encoding: PyUnicode_EncodeLocale/PyUnicode_DecodeLocale and _Py_EncodeLocaleEx/_Py_DecodeLocaleEx with current_locale=1. You can see which functions use it:

    • decode tm_zone field of localtime_r() and gmtime()
    • decode tzname[0] and tzname[1] strings
    • decode setlocale() result
    • decode some localeconv() fields (this function requires to switch to different locale encoding, it's bad!)
    • decode nl_langinfo() result
    • decode gettext(), dgettext(), dcgettext(), textdomain(), bindtextdomain(), bind_textdomain_codeset() result
    • decode strerror() and dlerror() result
    • encode/decode in the readline module
    • encode format string for strftime() in time.strftime() (only used on Windows, Unix provides wcsftime) and then decode strftime() result

    encoding="locale" : Uses locale encoding regardless UTF-8 mode.

    Currently, open(encoding=None) doesn't work like that. For example, on macOS, Android and VxWorks, it always use UTF-8. And if the UTF-8 Mode is used, UTF-8 is used.

    In the PEP-597, I read the encoding="locale" is the same than encoding=None but don't emit an EncodingWarning. Where the PEP-597 changes the chosen encoding for encoding=None case? The PEP says "locale encoding" without specifying exactly what it is. In Python, it means different things depending on the context. There is subtle difference the **current** locale encoding and "the locale encoding". I agree that it needs some clarification :-)

    While we discuss encodings, I never understood why open() gets the current locale encoding from nl_langinfo(CODESET), encoding which can change at runtime while Python is running. For example, if thread A calls open(filename, encoding=None), thread B calls locale.localeconv(), and the LC_MONETARY locale uses a different encoding than the LC_CTYPE locale, thread A can get the LC_MONETARY encoding because of how locale.localeconv() is currently implemented: it changes temporarily LC_CTYPE to LC_MONETARY to decode the monetary fields of localeconv() result.

    I would prefer that Python uses the same encoding for the whole lifetime of the process, since the beginning until the end. The Python filesystem encoding is a good choice for that. It's the same than locale.getpreferredencoding(False) (currently used by open() and friends), but becomes different if the LC_CTYPE is changed (temporarily or permanently).

    @vstinner
    Copy link
    Member Author

    I created PR 24931 to add locale.get_current_locale_encoding(). I tried to clarified the differences between the "current locale encoding" and the "locale encoding".

    Maybe we should rename the "locale encoding" to the "Python locale encoding", since it's not what most Unix developers would expect. What do you think?

    While most locale function have no underscore in their name, it seems like the current trend is to allow underscores in names for *new* functions. For example, the sys module has without underscores:

    • sys.getallocatedblocks()
    • sys.getdefaultencoding()
    • sys.getfilesystemencodeerrors
    • ...

    But it got new functions with underscores:

    • sys.set_asyncgen_hooks()
    • sys.set_coroutine_origin_tracking_depth()

    ... and there are some old functions with underscores:

    • sys.exc_info()
    • sys.call_tracing()
    • sys._clear_type_cache()
    • sys._current_frames()

    In the locale module, there is one existing function with an undercore:

    • locale.format_string()

    @vstinner
    Copy link
    Member Author

    Now, the correct way in all this would be to just call setlocale(LC_ALL, '') at the start of the application

    Python now does that during its initialization on all platforms. So getpreferredencoding(False) is what its documentation says: the user preferred encoding, the LC_CTYPE locale encoding.

    On Python 3.7, _Py_SetLocaleFromEnv(LC_CTYPE) was called in _Py_InitializeCore() on Unix, but not on Windows.

    Since Python 3.8, _PyPreConfig_Write() calls _Py_SetLocaleFromEnv(LC_CTYPE) on all platforms including Windows. See bpo-34485 and my article for more details ("C locale on Windows" section):
    https://vstinner.github.io/python3-locales-encodings.html

    _Py_SetLocaleFromEnv(LC_CTYPE) calls setlocale(LC_CTYPE, ""), but has more complex code on Android.

    @vstinner
    Copy link
    Member Author

    locale.getencoding()

    which interfaces to nl_langinfo(CODESET) or the Windows code
    page and does not try to do any magic, ie. does *not* call
    setlocale(). It needs to return what the lib C currently
    knows and uses as encoding.

    This is locale.get_current_locale_encoding(). I would like to put "current" in the name, because there is a lot of confusion between get_current_locale_encoding() encoding and locale.getpreferredencoding(False) encoding. In locale.getpreferredencoding(False), Python ignores the locale in some cases which is counter intuitive.

    I propose to add new functions to reduce confusion and better document the subtle differences between the different "locale encodings".

    That's also why I propose to rename the "locale encoding" to the "Python locale encoding" in the documentation: clarify the Python ignores the locale sometimes.

    The PEP-538 (coerce the C locale) and PEP-540 (Python UTF-8 Mode) introduced confusion.

    @malemburg
    Copy link
    Member

    On 19.03.2021 11:36, STINNER Victor wrote:

    STINNER Victor <vstinner@python.org> added the comment:

    > locale.getencoding()
    >
    > which interfaces to nl_langinfo(CODESET) or the Windows code
    > page and does not try to do any magic, ie. does *not* call
    > setlocale(). It needs to return what the lib C currently
    > knows and uses as encoding.

    This is locale.get_current_locale_encoding(). I would like to put "current" in the name, because there is a lot of confusion between get_current_locale_encoding() encoding and locale.getpreferredencoding(False) encoding. In locale.getpreferredencoding(False), Python ignores the locale in some cases which is counter intuitive.

    These attempts have resulted much of the confusion around the locale
    module. It's better not to create more of it.

    • "locale" in the name is unnecessary, since this is the locale module.

    • If you add "current", people will rightly ask: then what do all the
      other APIs in the locale module return ? Of course, they all return
      the current state of settings :-) So this is unnecessary as well.

    locale.getencoding() works in the same way as locale.getlocale().
    It interfaces to the lib C and returns the current encoding setting
    as known by the lib C. It's just a more intuitive name than
    locale.nl_langinfo(CODESET) and works on Windows as well.

    And, again, locale.getpreferredencoding() should be deprecated.
    The API has been misused in too many ways and is completely broken
    by now. It was a good idea at the time, when Martin added it,
    even though I never liked the name.

    @vstinner
    Copy link
    Member Author

    Attached encodings.py lists the different "locale encodings" used by Python. Example:
    ---

    $ LANG=fr_FR ./python -X utf8 encodings.py fr_FR@euro
    Set LC_CTYPE to 'fr_FR@euro'

    LC_ALL env var: ''
    LC_CTYPE env var: ''
    LANG env var: 'fr_FR'
    LC_CTYPE locale: 'fr_FR@euro'
    Coerce C locale: 0
    Python UTF-8 Mode: 1

    (1) Python FS encoding
    sys.getfilesystemencoding(): 'utf-8'

    (2) Python locale encoding
    _locale._get_locale_encoding(): 'UTF-8'
    locale.getpreferredencoding(False): 'UTF-8'

    (3) Current locale encoding
    locale.get_current_locale_encoding(): 'ISO-8859-15'

    (4) And more encodings for more fun!
    locale.getdefaultlocale()[1]: 'ISO8859-1'
    locale.getpreferredencoding(True): 'UTF-8'
    ---

    Python starts with LC_CTYPE locale set to fr_FR (ISO8859-1), then the script sets the LC_CTYPE locale to fr_FR@euro (ISO-8859-15). The Python UTF-8 Mode is enabled explicitly. We get a funny combination of not less than 3 encodings!

    • UTF-8
    • ISO-8859-1
    • ISO-8859-15

    Which one is the correct one? Wel... It depends :-)

    (1) The Python filesystem encoding is used to call almost all operating system functions: encode to the OS and decode from the OS. Filenames, environment variables, command line options, etc.

    (2) The "Python" locale encoding is used by open() when no encoding is specific.

    (3) The current locale encoding is used for a limited amount of functions that I listed in msg389063. Most users should not use it.

    (4) locale.getpreferredencoding(True) is a weird beast. It is Python locale encoding until setlocale(LC_CTYPE, locale) is called for the first time. But it can be same if the Python UTF-8 Mode is enabled. I'm not sure in which category we should put this function :-(

    (4 bis) locale.getdefaultlocale()[1] is the only function returning the ISO-8859-1 encoding. This encoding is not used by any function. I'm not sure of the purpose of this function. It sounds confusing.

    I suggest to deprecate locale.getpreferredencoding(True).

    I'm not sure what to do with locale.getdefaultlocale(). Should we deprecate it? I never used this function. How is it used? For which purpose?

    I undertand that in 2000, locale.getdefaultlocale() was interesting to avoid calling setlocale(LC_CTYPE, ""). But Python 3 calls setlocale(LC_CTYPE, "") by default at startup since the early versions, and it's now called on all platforms since Python 3.8. Moreover, its internal database seems to be outdated and is painful to maintain (especially if we consider all platforms supported by Python, not only Linux, there are many issues on macOS).

    @vstinner
    Copy link
    Member Author

    Martin later added locale.getpreferredencoding(), which tries to
    determine the encoding in a different way new way, based on
    nl_langset(CODEINFO). As you mentioned, this intention was broken
    on several platforms by forcing UTF-8 as output.

    When I designed and implemented the PEP-540 (Python UTF-8 Mode), I tried to leave getpreferredencoding() unchanged. The problem was that I quickly got mojibake because too many functions call getpreferredencoding(False):

    • open() and _pyio.open() -- in Python 3.10, open() now calls the C _Py_GetLocaleEncoding() function to fix issues during Python shutdown, it also avoids issues at startup.
    • Many gettext functions
    • cgi to decode the query string from QUERY_STRING env var or sys.argv[1]}
    • xml.etree.ElementTree.write(encoding="unicode") is some cases

    The Python UTF-8 Mode ignores the locale *on purpose*. But I agree that it's surprising and can lead to confusion. That's what I'm trying to fix here :-)

    @vstinner
    Copy link
    Member Author

    @eryksun
    Copy link
    Contributor

    eryksun commented Mar 19, 2021

    Read the ANSI code page on Windows,

    I don't see why the Windows implementation is inconsistent with POSIX here. If it were changed to be consistent, the default encoding at startup would remain the same, since setlocale(LC_CTYPE, "") uses the process code page from GetACP(). In many if not most cases, no one would be the wiser. But it seems to me that if a script calls setlocale(LC_CTYPE, "el_GR"), then it clearly wants to encode Greek text (code page 1253). open() with encoding passed as None or "locale" should respect this. Similarly if it calls setlocale(LC_CTYPE, ".UTF-8"), then it wants the default locale (language/region), but with UTF-8 encoding.

    The following is a snippet to get the current locale encoding with ucrt in Windows:

        #include <locale.h>
    
        int cp = 0;
        __crt_locale_data_public *locale_data;
    
        _locale_t locale = _get_current_locale();
        if (locale) {
            locale_data = (__crt_locale_data_public *)locale->locinfo;
            cp = locale_data->_locale_lc_codepage;
           _free_locale(locale);
        }
    
        if (cp == 0) {
        /* "C" locale. The CRT in effect uses Latin-1 (cp28591), but 
           Windows Python prefers the process code page. */
            cp = GetACP();
        }

    With ucrt, the C runtime was changed to hide most of the locale definition that was previously public, but it intentionally defines __crt_locale_data_public, so I'm assuming it's there for programs to use. That said, the fact that we have to cast locinfo seems suspect to me. Steve Dower could maybe check with the ucrt devs to ensure that this is supported.

    There's also ___lc_codepage() to get the same value more simply, and also more efficiently since the current locale data doesn't have to be copied and freed. However, it's documented as internal and could be removed (unlikely as that is).

    @malemburg
    Copy link
    Member

    On 19.03.2021 12:05, STINNER Victor wrote:

    I'm not sure what to do with locale.getdefaultlocale(). Should we deprecate it? I never used this function. How is it used? For which purpose?

    I undertand that in 2000, locale.getdefaultlocale() was interesting to avoid calling setlocale(LC_CTYPE, ""). But Python 3 calls setlocale(LC_CTYPE, "") by default at startup since the early versions, and it's now called on all platforms since Python 3.8. Moreover, its internal database seems to be outdated and is painful to maintain (especially if we consider all platforms supported by Python, not only Linux, there are many issues on macOS).

    Yes, deprecate it as well. If Python calls setlocale() per default now,
    it has served its purpose.

    The alias database is needed by the normalization engine. We may be
    able to drop the encoding part, but this would have to be checked.

    @malemburg
    Copy link
    Member

    On 19.03.2021 12:26, STINNER Victor wrote:

    STINNER Victor <vstinner@python.org> added the comment:

    Recently, I spent some days to document properly encodings used by Python.

    Thanks for documenting this.

    I would prefer to leave the locale module to really just an interface
    to the lib C locale logic and not add encoding details which are
    specific to Python's view on I/O (sys or io) or the file system (os).

    Hopefully, in a few years, we can get rid of all this and standardize
    on UTF-8 everywhere.

    @malemburg
    Copy link
    Member

    On 19.03.2021 12:35, Eryk Sun wrote:

    Eryk Sun <eryksun@gmail.com> added the comment:

    > Read the ANSI code page on Windows,

    I don't see why the Windows implementation is inconsistent with POSIX here. If it were changed to be consistent, the default encoding at startup would remain the same, since setlocale(LC_CTYPE, "") uses the process code page from GetACP().

    I'm not sure I understand what you're saying (but then, I have little
    experience with locales on Windows).

    My assumption is that nl_langinfo(CODESET) does not work on Windows
    or gives wrong results. Is that incorrect ?

    If it does work, getencoding() could just be a shim over
    nl_langinfo(CODESET) on all platforms.

    @eryksun
    Copy link
    Contributor

    eryksun commented Mar 19, 2021

    If Python calls setlocale() per default now, it has served its purpose.

    Except not for embedding applications if configure_locale [1] isn't set. But in that case determining the default locale isn't Python's problem to solve.

    My assumption is that nl_langinfo(CODESET) does not work on Windows
    or gives wrong results. Is that incorrect ?

    There is no such function for CRT locales. I provided two alternatives that would allow implementing this consistent with POSIX, and thus avoid all of the "except on Windows..." disclaimers that have to explain (apologize) that only the process ANSI code page is used in Windows, and, for no good reason as far as I can tell, the LC_CTYPE locale encoding is completely ignored.

    ---

    [1] https://docs.python.org/3/c-api/init_config.html#c.PyPreConfig.configure_locale

    @malemburg
    Copy link
    Member

    On 19.03.2021 13:25, Eryk Sun wrote:

    > My assumption is that nl_langinfo(CODESET) does not work on Windows
    > or gives wrong results. Is that incorrect ?

    There is no such function for CRT locales. I provided two alternatives that would allow implementing this consistent with POSIX, and thus avoid all of the "except on Windows..." disclaimers that have to explain (apologize) that only the process ANSI code page is used in Windows, and, for no good reason as far as I can tell, the LC_CTYPE locale encoding is completely ignored.

    Sounds good. If we can get consistent behavior on Windows as well,
    all the better :-)

    @vstinner
    Copy link
    Member Author

    I created bpo-43557 "Deprecate getdefaultlocale(), getlocale() and normalize() functions". Let's discuss deprecating getdefaultlocale() there.

    @vstinner
    Copy link
    Member Author

    • If you add "current", people will rightly ask: then what do all the
      other APIs in the locale module return ? Of course, they all return
      the current state of settings :-) So this is unnecessary as well.

    The problem is that there are two different "locale encodings", what I call:

    • "current locale encoding": nl_langinfo(CODESET) in short
    • "Python locale encoding": "UTF-8" in some cases, nl_langinfo(CODESET) otherwise

    It is unfortunate that the Python UTF-8 Mode which "ignores the locale" changes the behavior of the locale module, of the locale.getpreferredencoding() function. But the ship has sailed.

    People are used to look into the "locale" module to get the "locale" encoding. So I prefer to put the function to get the "Python locale encoding" in the locale module.

    I propose to add "current" in the name since this encoding is not the one you are looking for usually.

    An alternative is to have a single function with an optional parameter. Example:

    • get_locale_encoding() or get_locale_encoding(True) returns the locale encoding
    • get_locale_encoding(False) returns the current locale encoding

    @methane
    Copy link
    Member

    methane commented Mar 19, 2021

    I created this issue while reviewing the implementation of the PEP-597: PR 19481.

    What I want is same to locale.getpreferredencoding(False) but ignores UTF-8 mode.

    Background: PEP-597 adds new encoding="locale"option to open() and TextIOWrapper(). It is same to encoding=None for now, but it means using "locale encoding" explicitly.

    But this is wrong in UTF-8 mode.

    In UTF-8 mode, it's fine to open(filename) uses UTF-8. But I want to use "locale encoding" for open(filename, encoding="locale") because "locale" encoding is specified.

    I don't want to add new meaning here. It should be same to locale.getpreferredencoding(False) without UTF-8 mode. So I need "cp%d" % GetACP() on Windows, not CRT locale encoding.

    I don't care its name. both of sys.locale_encoding() and locale.get_encoding() are OK.

    @vstinner
    Copy link
    Member Author

    In UTF-8 mode, it's fine to open(filename) uses UTF-8. But I want to use "locale encoding" for open(filename, encoding="locale") because "locale" encoding is specified.

    Is it about the current implementation of the PEP-597, or are you thinking at the future Python which would use UTF-8 by default?

    Currently, getpreferredencoding(False) respects the behavior that you described, no?

    @methane
    Copy link
    Member

    methane commented Mar 19, 2021

    Is it about the current implementation of the PEP-597, or are you thinking at the future Python which would use UTF-8 by default?

    I had forgot to consider about UTF-8 mode while finishing PEP-597. If possible, I want to ignore UTF-8 mode when encoding="locale" is specified from Python 3.10.
    Otherwise, behavior will be changed between Python 3.10 and 3.11.

    Currently, getpreferredencoding(False) respects the behavior that you described, no?

    getpreferredencoding(False) respects UTF-8 mode. That's what PEP-597 said (because the PEP don't define behavior in UTF-8 mode) and #63680 implements.

    But it is not what I want for now. I want to ignore UTF-8 mode when encoding="locale" is specified.

    This is almost "only in Windows" issue, and users can use encoding="mbcs" in Windows-only script.

    But encoding="locale" is new and recommended way to specify using "locale" encoding explicitly. When user specify "locale" encoding explicitly, I think we should respect it regardless UTF-8 mode.

    @vstinner
    Copy link
    Member Author

    Hum, latest messages are specific to the PEP-597 (implementation).

    I had forgot to consider about UTF-8 mode while finishing PEP-597.

    I propose to continue the discussion about the PEP-597 in bpo-43510. I replied there.

    I prefer to keep this issue to discuss the locale module.

    @malemburg
    Copy link
    Member

    On 19.03.2021 14:47, STINNER Victor wrote:

    STINNER Victor <vstinner@python.org> added the comment:

    > - If you add "current", people will rightly ask: then what do all the
    > other APIs in the locale module return ? Of course, they all return
    > the current state of settings :-) So this is unnecessary as well.

    The problem is that there are two different "locale encodings", what I call:

    • "current locale encoding": nl_langinfo(CODESET) in short
    • "Python locale encoding": "UTF-8" in some cases, nl_langinfo(CODESET) otherwise

    The UTF-8 mode is a Python invention. It doesn't have anything to
    do with the lib C locale functions, which this module addresses and
    interfaces to.

    Please don't mix the two.

    In fact, in order to avoid issues, Python should probably set the locale
    encoding to UTF-8 as well, when run in UTF-8 mode. It's dangerous to
    have Python and the lib C use different assumptions about the encoding,
    esp. in embedded applications.

    It is unfortunate that the Python UTF-8 Mode which "ignores the locale" changes the behavior of the locale module, of the locale.getpreferredencoding() function. But the ship has sailed.

    People are used to look into the "locale" module to get the "locale" encoding. So I prefer to put the function to get the "Python locale encoding" in the locale module.

    I propose to add "current" in the name since this encoding is not the one you are looking for usually.

    An alternative is to have a single function with an optional parameter. Example:

    • get_locale_encoding() or get_locale_encoding(True) returns the locale encoding
    • get_locale_encoding(False) returns the current locale encoding

    -1, both on the names and the idea to again add parameters which change
    their meaning. We should have one function per meaning and really
    only need the interface getencoding(), since the UTF-8 mode
    doesn't fit into the locale module scope.

    @malemburg
    Copy link
    Member

    On 19.03.2021 14:57, Inada Naoki wrote:

    Background: PEP-597 adds new encoding="locale"option to open() and TextIOWrapper(). It is same to encoding=None for now, but it means using "locale encoding" explicitly.

    But this is wrong in UTF-8 mode.

    Please address UTF-8 mode explicitly in open() or elsewhere. The locale
    module is about the state of the lib C, not what Python enforces via
    options in its own I/O layers.

    As mentioned, both should ideally be synchronized, though, so
    UTF-8 mode in Python should trigger setting a UTF-8 encoding
    via setlocale().

    @methane
    Copy link
    Member

    methane commented Mar 19, 2021

    Please address UTF-8 mode explicitly in open() or elsewhere. The locale
    module is about the state of the lib C, not what Python enforces via
    options in its own I/O layers.

    I agree with you. APIs in locale module shouldn't aware UTF-8 mode.

    locale.getpreferredencoding() is special, because it "Return the encoding used for text data, according to user preferences. User preferences are expressed differently on different systems, and might not be available programmatically on some systems, so this function only returns a guess."

    As mentioned, both should ideally be synchronized, though, so
    UTF-8 mode in Python should trigger setting a UTF-8 encoding
    via setlocale().

    There is PEP-538 already :)

    @malemburg
    Copy link
    Member

    On 19.03.2021 16:15, Inada Naoki wrote:

    locale.getpreferredencoding() is special, because it "Return the encoding used for text data, according to user preferences. User preferences are expressed differently on different systems, and might not be available programmatically on some systems, so this function only returns a guess."

    I already wrote earlier that we should deprecate this API, since the
    overloading with different meanings in the past has turned it into
    an unreliable source of information. At this point, it returns
    "some encoding, which may or may not be what you want" :-)

    We need to get things separated out clearly again: the locale
    module is for the lib C locale state. What Python does in the
    I/O layers has to be defined and queries at the appropriate
    places elsewhere (e.g. os, sys or io modules).

    > As mentioned, both should ideally be synchronized, though, so
    > UTF-8 mode in Python should trigger setting a UTF-8 encoding
    > via setlocale().

    There is PEP-538 already :)

    Great :-)

    @eryksun
    Copy link
    Contributor

    eryksun commented Mar 19, 2021

    But it is not what I want for now. I want to ignore UTF-8 mode
    when encoding="locale" is specified.
    This is almost "only in Windows" issue, and users can use
    encoding="mbcs" in Windows-only script.

    Why is it being specified that the current LC_CTYPE encoding should be ignored in Windows when a "locale" encoding is requested? Cross-platform C code would use mbstowcs() and wcstombs(), with the current LC_CTYPE encoding. That's Latin-1 in the initial "C" locale and defaults to GetACP() if setlocale(LC_CTYPE, "") is called, but otherwise it's whatever locale is requested by the program and supported by the system (all Windows installations support pretty much every locale).

    @methane
    Copy link
    Member

    methane commented Mar 20, 2021

    Why is it being specified that the current LC_CTYPE encoding should be ignored in Windows when a "locale" encoding is requested?

    Because encoding="locale" must be replacement of the current encoding=None (i.e. locale.getpreferredencoding(False).

    encoding=None behavior will be changed if we change the default encoding or enable UTF-8 mode by default. So we are adding an explicit name to current behavior.

    So It is not an option to assign other encoding. ​See PEP-597 for detail.

    I know you are proposing to use CRT locale on Windows. If we change the locale.getpreferredencoding(False) to use CRT locale, encoding="locale" follow it.
    But please discuss it in another issue.

    @eryksun
    Copy link
    Contributor

    eryksun commented Mar 20, 2021

    But please discuss it in another issue.

    What's returned by locale.get_locale_encoding() and locale.get_current_locale_encoding() is relevant to adding them as new functions and is a chance to implement this correctly in Windows.

    You're right that what open() does for encoding="locale" is a separate issue, with backwards compatibility problems. I think it was implemented badly and needlessly inconsistent with POSIX. But we may be stuck with the behavior considering scripts are within their rights, per documented behavior, to expect that calling setlocale(LC_CTYPE, locale_name) in Windows has no effect on the result of locale.getpreferredencoding(False), unlike POSIX generally, except for some platforms such as macOS and Android.

    @vstinner
    Copy link
    Member Author

    Python uses GetACP(), the ANSI code page of the operating system, for years. What is the advantage of using a different encoding? In my experience, most applications use the ANSI code page because they use the ANSI flavor of the Windows API.

    What is the use case for using ___lc_codepage()? Is it a different encoding?

    @eryksun
    Copy link
    Contributor

    eryksun commented Mar 20, 2021

    In my experience, most applications use the ANSI code page because
    they use the ANSI flavor of the Windows API.

    The default encoding at startup and in the "C" locale wouldn't change. It would only differ from the default if setlocale(LC_CTYPE, locale_name) sets it otherwise. The suggestion is to match the behavior of nl_langinfo(CODESET) in Linux and many other POSIX systems.

    When I say the default encoding won't change, I mean that the Universal C Runtime (ucrt) system component uses the process ANSI code page as the default locale encoding for setlocale(LC_CTYPE, ""). This agrees with what Python has always done, but it disagrees with previous versions of the CRT in Windows. Personally, I think it's a misstep because the user locale isn't necessarily compatible with the process code page, but I'm not looking to change this decision. For example, if the user locale is "el_GR" (Greek, Greece) but the process code page is 1252 (Latin) instead of 1253 (Greek), I get the following result in Python 3.4 (VC++ 10) vs Python 3.5 (ucrt):

    >py -3.4 -c "from locale import *; print(setlocale(LC_CTYPE, ''))"
    Greek_Greece.1253
    
    >py -3.5 -c "from locale import *; print(setlocale(LC_CTYPE, ''))"
    Greek_Greece.1252
    

    The result from VC++ 10 is consistent with the user locale. It's also consistent with multilingual user interface (MUI) text, such as error messages, or at least it should be, because the user locale and user preferred language (i.e. Windows display language) should be consistent. (The control panel dialog to set the user locale in Windows 10 has an option to match the display language, which is the recommended and default setting.) For example, Python uses system error messages that are localized to the user's preferred language:

        >py -c "import os; os.stat('spam')"
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
        FileNotFoundError: [WinError 2] Δεν είναι δυνατή η εύρεση του καθορισμένου αρχείου από το σύστημα: 'spam'

    This example is on a system where the process (system) ANSI code page is 1252 (Latin), which cannot encode the user's preferred Greek text. Thankfully Python 3.6+ uses the console's Unicode API, so neither the console session's output code page nor the process code page gets in the way. On the other hand, if this Greek text is written to a file or piped to a child process using subprocess.Popen(), Python's choice of locale encoding based on the process code page (Latin) is incompatible with Greek text, and thus it's incompatible with the current user's preferred locale and language settings.

    The process ANSI code page from GetACP() has its uses, which are important. It's a system setting that's independent of the current user locale and thus useful when interacting with the legacy system API and as a common encoding for inter-process data exchange when applications do not use Unicode and may be operating in different locales. So if you're writing to a legacy-encoded text file that's shared by multiple users or piping text to an arbitrary program, then using the ANSI code page is probably okay. Though, especially for IPC, there's a good chance that's it's wrong since Windows has never set, let alone enforced, a standard in that case.

    Using the process ANSI code page in the "C" locale makes sense to me.

    What is the use case for using ___lc_codepage()? Is it a different
    encoding?

    I always forget the "_func" suffix in the name; it's ___lc_codepage_func() [1]. The lc_codepage value is the current LC_CTYPE codeset as an integer code page. It's the equivalent of nl_langinfo(CODESET) in POSIX. For UTF-8, the code page is CP_UTF8 (65001), but this get displayed in locale strings as "UTF-8" (or variants such as "utf8"). It could be the LC_CTYPE encoding of just the current thread, but Python does not enable per-thread locales.

    The CRT has exported ___lc_codepage_func() since VC++ 7.0 (2002), and before that the current lc_codepage value itself was directly exported as __lc_codepage. However, this triple-dundered function is documented as internal and not recommended for use. That's why the code snippet I showed uses _get_current_locale() with locinfo cast to __crt_locale_data_public *. This takes "public" in the struct name at face value. Anything that's declared public should be safe to use, but the locale_t type is frustratingly undocumented even for this public data [2].

    If neither approach is supported, locale.get_current_locale_encoding() could instead parse the current locale encoding from setlocale(LC_CTYPE, NULL). The resulting locale string usually includes the codeset (e.g. "Greek_Greece.1253"). The exceptions are the "C" locale and BCP-47 (RFC 5646) locales that do not explicitly use UTF-8 (e.g. "el_GR" or "el" instead of "el_GR.UTF-8"), but these cases can be handled reliably.

    ---

    [1] https://docs.microsoft.com/en-us/cpp/c-runtime-library/lc-codepage-func
    [2] https://docs.microsoft.com/en-us/cpp/c-runtime-library/locale

    @vstinner
    Copy link
    Member Author

    PEP-597 was implemented successfully in Python 3.10 with this feature.

    This is no agreement yet on what is the "current locale encoding".

    For now, I prefer to close the issue.

    We can reconsider this feature once there will be more user requests for such function and when there will be an agreement on what is the "current locale encoding".

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants