Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new codec: "locale", the current locale encoding #57828

Closed
vstinner opened this issue Dec 17, 2011 · 9 comments
Closed

Add a new codec: "locale", the current locale encoding #57828

vstinner opened this issue Dec 17, 2011 · 9 comments
Labels
stdlib Python modules in the Lib dir topic-unicode type-feature A feature request or enhancement

Comments

@vstinner
Copy link
Member

BPO 13619
Nosy @malemburg, @loewis, @pitrou, @vstinner, @ezio-melotti
Files
  • locale_encoding-3.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2012-02-10.22:34:39.021>
    created_at = <Date 2011-12-17.06:13:45.329>
    labels = ['type-feature', 'library', 'expert-unicode']
    title = 'Add a new codec: "locale", the current locale encoding'
    updated_at = <Date 2012-02-10.22:34:39.020>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2012-02-10.22:34:39.020>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2012-02-10.22:34:39.021>
    closer = 'vstinner'
    components = ['Library (Lib)', 'Unicode']
    creation = <Date 2011-12-17.06:13:45.329>
    creator = 'vstinner'
    dependencies = []
    files = ['24446']
    hgrepos = []
    issue_num = 13619
    keywords = ['patch']
    message_count = 9.0
    messages = ['149660', '149662', '149671', '149678', '149946', '150114', '150122', '152819', '153080']
    nosy_count = 5.0
    nosy_names = ['lemburg', 'loewis', 'pitrou', 'vstinner', 'ezio.melotti']
    pr_nums = []
    priority = 'normal'
    resolution = 'wont fix'
    stage = 'patch review'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue13619'
    versions = ['Python 3.3']

    @vstinner
    Copy link
    Member Author

    To factorize the code and to fix encoding issues in the time module, I added functions to decode/encode from/to the locale encoding: PyUnicode_DecodeLocale(), PyUnicode_DecodeLocaleAndSize() and PyUnicode_EncodeLocale() (issue bpo-13560). During tests, I realized that os.strerror() should also use the current locale encoding.

    Do you think that the codec should be exposed in Python?

    --

    The C functions are used by:

    • the locale module to decode result of locale functions
    • Py_Main() to decode the PYTHONWARNING environment variable (PyUnicode_DecodeFSDefault can be used here, but PyUnicode_DecodeFSDefault would just call PyUnicode_DecodeLocale because the Python codec is not loaded yet, a funny bootstrap issue)
    • PyUnicode_EncodeFSDefault() and PyUnicode_DecodeFSDefaultAndSize before the locale encoding is known and the Python codec is fully ready
    • os.strerror() and PyErr_SetFromErrno*() to decode the error message
    • time.strftime() to encode the format and decode the result if the wcsftime() function is not available and on Windows. On Windows, wcsftime() is available but avoided to workaround an encoding issue in the timezone (see the issue bpo-10653)
    • time to decode time.tzname

    The codec can be useful for developers interacting with C functions depending on the locale. Examples: strerror(), strftime(), ... Use the filesystem encoding would be wrong for such function because the locale encoding can be changed by setlocale() with LC_CTYPE or LC_ALL. Use the filesystem encoding would lead to mojibake.

    Even if the most common usecases of C functions depending on the locale are already covered by the Python standard library, developers may want to bind new functions using ctypes (or something else), and I believe that the locale encoding would be useful for these bindings.

    --

    The problem with a new codec is that it becomes more difficult to choose the right encoding:

    • filesystem encoding: filenames, directory names, hostname, environment variables, command line arguments
    • mbcs (ANSI code page): (basically, it is just an alias of the filesystem encoding)
    • locale: write bindings for new C functions?

    I suppose that this issue can be solve by writing documentation explaining the usage of each codec.

    --

    Attached patch adds the new locale codec.

    The major limitation of the current implementation is that the codec only supports the strict and the surrogateescape error handlers. I don't plan to implement other error handlers because I don't think that they would be useful, but it would be possible to implement them.

    --

    I would be "nice" to fix os.strerror() and time.strftime() in Python 3.2, but I don't want to fix them because it would require to add the locale codec and I don't want to do such change in a stable version. The issue only concerns few people changing their locale encoding at runtime. I hope that everybody uses UTF-8 and never change their locale encoding to something else ;-)

    @vstinner vstinner added the stdlib Python modules in the Lib dir label Dec 17, 2011
    @vstinner
    Copy link
    Member Author

    # On FreeBSD, Solaris and Mac OS X, b'\xff' can be decoded in
    # the C locale. The C locale is something like ISO-8859-1, not
    # 7-bit ASCII.

    On FreeBSD, it *is* the ISO-8859-1 encoding.

    @vstinner
    Copy link
    Member Author

    Patch version 2: improve the test. Try also the user locale encoding if the C locale uses ISO-8859-1 (should improve the code coverage on FreeBSD, Mac OS X and Solaris).

    @ezio-melotti ezio-melotti added topic-unicode type-feature A feature request or enhancement labels Dec 17, 2011
    @vstinner
    Copy link
    Member Author

    I tested locale_encoding-2.patch on Linux, FreeBSD and Windows: UTF-8 and ISO-8859-1 locales on Linux and FreeBSD, and the cp1252 ANSI code page on Windows.

    @vstinner
    Copy link
    Member Author

    I would be possible to implement incremental decoder with mbsrtowcs() and incremental encoder with wcsrtombs(), by serializing mbstate_t to a long integer (TextIOWrapper.tell() does something like that). The problem is that mbsrtowcs() and wcsrtombs() are "recent" (not always available). It may also be dangerous to allow the user to pass an arbitrary mbstate_t (using .setstate()).

    @vstinner
    Copy link
    Member Author

    + encoding = locale.getpreferredencoding()

    It should be locale.getpreferredencoding(False).

    @pitrou
    Copy link
    Member

    pitrou commented Dec 23, 2011

    I'm not sure I like this idea. I think it would be nice to see it discussed on python-dev.

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 7, 2012

    • encoding = locale.getpreferredencoding()
      It should be locale.getpreferredencoding(False).

    Fixed in patch version 3.

    @vstinner
    Copy link
    Member Author

    According to the discussion on the python-dev mailing list, such codec would add too much confusion to users and so it is better to not add it.
    http://mail.python.org/pipermail/python-dev/2012-February/116272.html

    I close the issue as wont fix.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir topic-unicode type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants