Message114191
Python 3 has a very important variable: the filesystem encoding, sys.getfilesystemencoding(). It is used to encode and decode filenames to access to the filesystem, to encode program arguments in subprocess, etc.
The encoding is hardcoded to "mbcs" on Windows and "utf-8" on Mac OS X. On other OSes, Python gets the encoding from the locale. The problem is that the code getting the locale encoding loads Python modules (eg. locale) and Python uses a default encoding before the locale encoding is known. As a result, modules and code objects created before Python sets the locale encoding are encoded with the old encoding.
The default encoding is "utf-8". If the locale encoding is also "utf-8", there is no problem because the filename are correctly encoded. If the locale encoding is different, we keep filenames encoded in the wrong encoding.
It becomes worse when the locale encoding is unable to encode the filenames, eg. ASCII encoding.
--
A solution would be to avoid loading any Python module, but I don't think that it is possible. The locale encoding can be something different than ascii, latin-1, utf-8 or mbcs. The locale encoding can be an alias like 'utf8' (instead of 'utf-8'), 'iso-8859-1' (Python uses 'latin_1') or 'ANSI_x3.4_1968' (for 'ascii') and encoding aliases are implemented as Lib/encodings/aliases.py which is... a Python module.
--
I wrote a patch to reencode filenames of all module and code objects in initfsencoding() when the locale encoding is known.
I tested my patch on my import_unicode branch (branch to fix #8611, see also #9425: issue to merge the branch to py3k). I would like one or more reviews of the patch because it is long and complex. Please check for refleaks :-)
--
About the patch.
I don't know how to list *all* code objects and so I created a list to store weak references to all code objects, list filled by the code object constructor. The list is destroyed at initfsencoding() exit (early in Python initialization).
There is a FIXME: I don't know if sys.path_importer_cache keys should also be reencoded.
I tried to apply all remarks made on the first patch (posted on Rietveld for #9425). The patch now stores weak references instead of strong references to code objects in the code object list.
(r84168 creates PyModule_GetFilenameObject, function needed by this patch) |
|
Date |
User |
Action |
Args |
2010-08-17 23:47:07 | vstinner | set | recipients:
+ vstinner |
2010-08-17 23:47:06 | vstinner | set | messageid: <1282088826.81.0.444273804054.issue9630@psf.upfronthosting.co.za> |
2010-08-17 23:47:04 | vstinner | link | issue9630 messages |
2010-08-17 23:47:03 | vstinner | create | |
|