Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lower case file system encoding #48463

Closed
tiran opened this issue Oct 27, 2008 · 10 comments
Closed

Lower case file system encoding #48463

tiran opened this issue Oct 27, 2008 · 10 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) release-blocker type-bug An unexpected behavior, bug, or error

Comments

@tiran
Copy link
Member

tiran commented Oct 27, 2008

BPO 4213
Nosy @malemburg, @loewis, @warsaw, @vstinner, @tiran
Files
  • get_codeset.patch: Use lookup(codeset).name as charset
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/warsaw'
    closed_at = <Date 2008-10-30.21:40:25.807>
    created_at = <Date 2008-10-27.13:29:14.027>
    labels = ['interpreter-core', 'type-bug', 'release-blocker']
    title = 'Lower case file system encoding'
    updated_at = <Date 2008-10-30.22:38:38.348>
    user = 'https://github.com/tiran'

    bugs.python.org fields:

    activity = <Date 2008-10-30.22:38:38.348>
    actor = 'vstinner'
    assignee = 'barry'
    closed = True
    closed_date = <Date 2008-10-30.21:40:25.807>
    closer = 'christian.heimes'
    components = ['Interpreter Core']
    creation = <Date 2008-10-27.13:29:14.027>
    creator = 'christian.heimes'
    dependencies = []
    files = ['11896']
    hgrepos = []
    issue_num = 4213
    keywords = ['patch']
    message_count = 10.0
    messages = ['75252', '75253', '75254', '75255', '75256', '75257', '75276', '75384', '75387', '75392']
    nosy_count = 5.0
    nosy_names = ['lemburg', 'loewis', 'barry', 'vstinner', 'christian.heimes']
    pr_nums = []
    priority = 'release blocker'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue4213'
    versions = ['Python 2.6', 'Python 3.0', 'Python 2.7']

    @tiran
    Copy link
    Member Author

    tiran commented Oct 27, 2008

    Python should lower case the file system encoding in Python/pythonrun.c.
    On several occasions Python optimizes code paths for lower case
    encodings like "utf-8" or "latin-1". On my Ubuntu system the file system
    encoding is upper case ("UTF-8") and the optimizations aren't used. This
    also causes problems with sub interpreters bpo-3723 initstdio() in the sub
    interpreter fails because "UTF-8" must be looked up in the codecs and
    encoding registry while "utf-8" works like a charm.

    $ python2.6 -c "import sys; print sys.getfilesystemencoding()"
    UTF-8
    
    $ python3.0 -c "import sys; print(sys.getfilesystemencoding())"
    UTF-8
    
    $ locale
    LANG=de_DE.UTF-8
    LANGUAGE=en_US:en:de_DE:de
    LC_CTYPE="de_DE.UTF-8"
    LC_NUMERIC="de_DE.UTF-8"
    LC_TIME="de_DE.UTF-8"
    LC_COLLATE="de_DE.UTF-8"
    LC_MONETARY="de_DE.UTF-8"
    LC_MESSAGES="de_DE.UTF-8"
    LC_PAPER="de_DE.UTF-8"
    LC_NAME="de_DE.UTF-8"
    LC_ADDRESS="de_DE.UTF-8"
    LC_TELEPHONE="de_DE.UTF-8"
    LC_MEASUREMENT="de_DE.UTF-8"
    LC_IDENTIFICATION="de_DE.UTF-8"
    LC_ALL=

    The patch is trivial:

    	if (codeset) {
    		if (!Py_FileSystemDefaultEncoding) {
    			char *p;
    			for (p=codeset; *p; p++)
    				*p = tolower(*p);
    			Py_FileSystemDefaultEncoding = codeset;
                    }
    		else
    			free(codeset);
    	}

    Python/codecs.c:normalizestring() does a similar job. Maybe a new method
    "char* PyCodec_NormalizeEncodingName(const char*)" could be introduced
    for the problem.

    @tiran tiran added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error labels Oct 27, 2008
    @vstinner
    Copy link
    Member

    Converting to the lower case doesn't solve the problem: if the locale
    is "utf8" and Python checks for "utf-8", the optimization will fail.
    Another example: iso-8859-1, latin-1 or latin1?

    A correct patch would be to get the most common name of the charset
    and make sure that Python C code always use this name.

    @malemburg
    Copy link
    Member

    The lower-casing doesn't hurt, since that's done anyway during codec
    lookup, but I'd be -1 on making this try to duplicate the aliasing
    already done by the encodings package.

    @vstinner
    Copy link
    Member

    Here is a patch to get the "most common charset name": use
    codecs.lookup(codeset).name.

    @tiran
    Copy link
    Member Author

    tiran commented Oct 27, 2008

    Victor's patch fixes the issue with bpo-3723.

    @malemburg
    Copy link
    Member

    +1 on adding Viktor's patch.

    @tiran
    Copy link
    Member Author

    tiran commented Oct 28, 2008

    Me, too! The solution is elegant and works well.

    Barry still has to accept the patch, though.

    @tiran
    Copy link
    Member Author

    tiran commented Oct 30, 2008

    Fixed in r67055

    @tiran tiran closed this as completed Oct 30, 2008
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Oct 30, 2008

    The solution is elegant and works well.

    I can't agree with that evaluation. In cases where Python would fail
    without this patch (i.e. because the file system encoding cannot be
    found during startup), this solution doesn't work well in general - it
    only works if the file system encoding happens to be UTF-8. If the file
    system encoding is not in the list of "builtin" codec names, startup
    would still fail.

    r67057 addresses this case in a somewhat more general manner, by falling
    back to ASCII during startup, for encoding file names. This should work
    in the usual case where Python is in /usr/bin (say), but it's still
    possible to make it fail, e.g. if the codecs are in /home/Питон (say),
    on a system that uses koi8-r as the file system encoding, this bug would
    persist despite the two patches that have been applied.

    @vstinner
    Copy link
    Member

    Le Thursday 30 October 2008 23:14:21 Martin v. Löwis, vous avez écrit :

    I can't agree with that evaluation. In cases where Python would fail
    without this patch (i.e. because the file system encoding cannot be
    found during startup),

    My patch doesn't change the way how Python get the file system encoding: it
    just gets the "Python charset name" (eg. "utf-8" instead of "UTF8",
    or "iso8859-1" instead of "latin-1"). The goal was to enable the
    optimizations, especially with utf-8. It's not related to bpo-3723.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) release-blocker type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants