classification
Title: Lower case file system encoding
Type: behavior Stage:
Components: Interpreter Core Versions: Python 3.0, Python 2.7, Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: barry Nosy List: barry, christian.heimes, lemburg, loewis, vstinner
Priority: release blocker Keywords: patch

Created on 2008-10-27 13:29 by christian.heimes, last changed 2008-10-30 22:38 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
get_codeset.patch vstinner, 2008-10-27 14:35 Use lookup(codeset).name as charset
Messages (10)
msg75252 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-10-27 13:29
Python should lower case the file system encoding in Python/pythonrun.c.
On several occasions Python optimizes code paths for lower case
encodings like "utf-8" or "latin-1". On my Ubuntu system the file system
encoding is upper case ("UTF-8") and the optimizations aren't used. This
also causes problems with sub interpreters #3723 initstdio() in the sub
interpreter fails because "UTF-8" must be looked up in the codecs and
encoding registry while "utf-8" works like a charm.

$ python2.6 -c "import sys; print sys.getfilesystemencoding()"
UTF-8

$ python3.0 -c "import sys; print(sys.getfilesystemencoding())"
UTF-8

$ locale
LANG=de_DE.UTF-8
LANGUAGE=en_US:en:de_DE:de
LC_CTYPE="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_PAPER="de_DE.UTF-8"
LC_NAME="de_DE.UTF-8"
LC_ADDRESS="de_DE.UTF-8"
LC_TELEPHONE="de_DE.UTF-8"
LC_MEASUREMENT="de_DE.UTF-8"
LC_IDENTIFICATION="de_DE.UTF-8"
LC_ALL=

The patch is trivial:

	if (codeset) {
		if (!Py_FileSystemDefaultEncoding) {
			char *p;
			for (p=codeset; *p; p++)
				*p = tolower(*p);
			Py_FileSystemDefaultEncoding = codeset;
                }
		else
			free(codeset);
	}

Python/codecs.c:normalizestring() does a similar job. Maybe a new method
"char* PyCodec_NormalizeEncodingName(const char*)" could be introduced
for the problem.
msg75253 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-10-27 14:14
Converting to the lower case doesn't solve the problem: if the locale 
is "utf8" and Python checks for "utf-8", the optimization will fail. 
Another example: iso-8859-1, latin-1 or latin1?

A correct patch would be to get the most common name of the charset 
and make sure that Python C code always use this name.
msg75254 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-10-27 14:30
The lower-casing doesn't hurt, since that's done anyway during codec
lookup, but I'd be -1 on making this try to duplicate the aliasing
already done by the encodings package.
msg75255 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-10-27 14:35
Here is a patch to get the "most common charset name": use 
codecs.lookup(codeset).name.
msg75256 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-10-27 15:03
Victor's patch fixes the issue with #3723.
msg75257 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-10-27 15:05
+1 on adding Viktor's patch.
msg75276 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-10-28 11:15
Me, too! The solution is elegant and works well.

Barry still has to accept the patch, though.
msg75384 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-10-30 21:40
Fixed in r67055
msg75387 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-10-30 22:14
> The solution is elegant and works well.

I can't agree with that evaluation. In cases where Python would fail
without this patch (i.e. because the file system encoding cannot be
found during startup), this solution doesn't work well in general - it
only works if the file system encoding happens to be UTF-8. If the file
system encoding is not in the list of "builtin" codec names, startup
would still fail.

r67057 addresses this case in a somewhat more general manner, by falling
back to ASCII during startup, for encoding file names. This should work
in the usual case where Python is in /usr/bin (say), but it's still
possible to make it fail, e.g. if the codecs are in /home/Питон (say),
on a system that uses koi8-r as the file system encoding, this bug would
persist despite the two patches that have been applied.
msg75392 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-10-30 22:38
Le Thursday 30 October 2008 23:14:21 Martin v. Löwis, vous avez écrit :
> I can't agree with that evaluation. In cases where Python would fail
> without this patch (i.e. because the file system encoding cannot be
> found during startup),

My patch doesn't change the way how Python get the file system encoding: it 
just gets the "Python charset name" (eg. "utf-8" instead of "UTF8", 
or "iso8859-1" instead of "latin-1"). The goal was to enable the 
optimizations, especially with utf-8. It's not related to #3723.
History
Date User Action Args
2008-10-30 22:38:38vstinnersetmessages: + msg75392
2008-10-30 22:14:20loewissetnosy: + loewis
messages: + msg75387
2008-10-30 21:40:25christian.heimessetstatus: open -> closed
resolution: accepted -> fixed
messages: + msg75384
2008-10-28 11:15:49christian.heimessetresolution: accepted
messages: + msg75276
2008-10-27 15:05:33lemburgsetmessages: + msg75257
2008-10-27 15:03:07christian.heimessetmessages: + msg75256
2008-10-27 14:35:16vstinnersetfiles: + get_codeset.patch
messages: + msg75255
2008-10-27 14:30:36lemburgsetnosy: + lemburg
messages: + msg75254
2008-10-27 14:14:39vstinnersetnosy: + vstinner
messages: + msg75253
2008-10-27 13:29:28christian.heimeslinkissue3723 dependencies
2008-10-27 13:29:14christian.heimescreate