Lower case file system encoding #48463

tiran · 2008-10-27T13:29:14Z

BPO	4213
Nosy	@malemburg, @loewis, @warsaw, @vstinner, @tiran
Files	get_codeset.patch: Use lookup(codeset).name as charset

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/warsaw'
closed_at = <Date 2008-10-30.21:40:25.807>
created_at = <Date 2008-10-27.13:29:14.027>
labels = ['interpreter-core', 'type-bug', 'release-blocker']
title = 'Lower case file system encoding'
updated_at = <Date 2008-10-30.22:38:38.348>
user = 'https://github.com/tiran'

bugs.python.org fields:

activity = <Date 2008-10-30.22:38:38.348>
actor = 'vstinner'
assignee = 'barry'
closed = True
closed_date = <Date 2008-10-30.21:40:25.807>
closer = 'christian.heimes'
components = ['Interpreter Core']
creation = <Date 2008-10-27.13:29:14.027>
creator = 'christian.heimes'
dependencies = []
files = ['11896']
hgrepos = []
issue_num = 4213
keywords = ['patch']
message_count = 10.0
messages = ['75252', '75253', '75254', '75255', '75256', '75257', '75276', '75384', '75387', '75392']
nosy_count = 5.0
nosy_names = ['lemburg', 'loewis', 'barry', 'vstinner', 'christian.heimes']
pr_nums = []
priority = 'release blocker'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue4213'
versions = ['Python 2.6', 'Python 3.0', 'Python 2.7']

tiran · 2008-10-27T13:29:12Z

Python should lower case the file system encoding in Python/pythonrun.c.
On several occasions Python optimizes code paths for lower case
encodings like "utf-8" or "latin-1". On my Ubuntu system the file system
encoding is upper case ("UTF-8") and the optimizations aren't used. This
also causes problems with sub interpreters bpo-3723 initstdio() in the sub
interpreter fails because "UTF-8" must be looked up in the codecs and
encoding registry while "utf-8" works like a charm.

$ python2.6 -c "import sys; print sys.getfilesystemencoding()"
UTF-8

$ python3.0 -c "import sys; print(sys.getfilesystemencoding())"
UTF-8

$ locale
LANG=de_DE.UTF-8
LANGUAGE=en_US:en:de_DE:de
LC_CTYPE="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_PAPER="de_DE.UTF-8"
LC_NAME="de_DE.UTF-8"
LC_ADDRESS="de_DE.UTF-8"
LC_TELEPHONE="de_DE.UTF-8"
LC_MEASUREMENT="de_DE.UTF-8"
LC_IDENTIFICATION="de_DE.UTF-8"
LC_ALL=

The patch is trivial:

	if (codeset) {
		if (!Py_FileSystemDefaultEncoding) {
			char *p;
			for (p=codeset; *p; p++)
				*p = tolower(*p);
			Py_FileSystemDefaultEncoding = codeset;
                }
		else
			free(codeset);
	}

Python/codecs.c:normalizestring() does a similar job. Maybe a new method
"char* PyCodec_NormalizeEncodingName(const char*)" could be introduced
for the problem.

vstinner · 2008-10-27T14:14:39Z

Converting to the lower case doesn't solve the problem: if the locale
is "utf8" and Python checks for "utf-8", the optimization will fail.
Another example: iso-8859-1, latin-1 or latin1?

A correct patch would be to get the most common name of the charset
and make sure that Python C code always use this name.

malemburg · 2008-10-27T14:30:37Z

The lower-casing doesn't hurt, since that's done anyway during codec
lookup, but I'd be -1 on making this try to duplicate the aliasing
already done by the encodings package.

vstinner · 2008-10-27T14:35:16Z

Here is a patch to get the "most common charset name": use
codecs.lookup(codeset).name.

tiran · 2008-10-27T15:03:07Z

Victor's patch fixes the issue with bpo-3723.

malemburg · 2008-10-27T15:05:33Z

+1 on adding Viktor's patch.

tiran · 2008-10-28T11:15:49Z

Me, too! The solution is elegant and works well.

Barry still has to accept the patch, though.

tiran · 2008-10-30T21:40:26Z

Fixed in r67055

loewis · 2008-10-30T22:14:20Z

The solution is elegant and works well.

I can't agree with that evaluation. In cases where Python would fail
without this patch (i.e. because the file system encoding cannot be
found during startup), this solution doesn't work well in general - it
only works if the file system encoding happens to be UTF-8. If the file
system encoding is not in the list of "builtin" codec names, startup
would still fail.

r67057 addresses this case in a somewhat more general manner, by falling
back to ASCII during startup, for encoding file names. This should work
in the usual case where Python is in /usr/bin (say), but it's still
possible to make it fail, e.g. if the codecs are in /home/Питон (say),
on a system that uses koi8-r as the file system encoding, this bug would
persist despite the two patches that have been applied.

vstinner · 2008-10-30T22:38:38Z

Le Thursday 30 October 2008 23:14:21 Martin v. Löwis, vous avez écrit :

I can't agree with that evaluation. In cases where Python would fail
without this patch (i.e. because the file system encoding cannot be
found during startup),

My patch doesn't change the way how Python get the file system encoding: it
just gets the "Python charset name" (eg. "utf-8" instead of "UTF8",
or "iso8859-1" instead of "latin-1"). The goal was to enable the
optimizations, especially with utf-8. It's not related to bpo-3723.

tiran added the release-blocker label Oct 27, 2008

tiran assigned warsaw Oct 27, 2008

tiran added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error labels Oct 27, 2008

tiran closed this as completed Oct 30, 2008

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower case file system encoding #48463

Lower case file system encoding #48463

tiran commented Oct 27, 2008

tiran commented Oct 27, 2008

vstinner commented Oct 27, 2008

malemburg commented Oct 27, 2008

vstinner commented Oct 27, 2008

tiran commented Oct 27, 2008

malemburg commented Oct 27, 2008

tiran commented Oct 28, 2008

tiran commented Oct 30, 2008

loewis mannequin commented Oct 30, 2008

vstinner commented Oct 30, 2008

Lower case file system encoding #48463

Lower case file system encoding #48463

Comments

tiran commented Oct 27, 2008

tiran commented Oct 27, 2008

vstinner commented Oct 27, 2008

malemburg commented Oct 27, 2008

vstinner commented Oct 27, 2008

tiran commented Oct 27, 2008

malemburg commented Oct 27, 2008

tiran commented Oct 28, 2008

tiran commented Oct 30, 2008

loewis mannequin commented Oct 30, 2008

vstinner commented Oct 30, 2008