Rewrite import machinery to work with unicode paths #53671

vstinner · 2010-07-30T00:13:33Z

BPO	9425
Nosy	@amauryfa, @pitrou, @kristjanvalur, @vstinner, @ezio-melotti, @merwok, @florentx

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2010-10-19.01:02:36.457>
created_at = <Date 2010-07-30.00:13:32.930>
labels = ['interpreter-core', 'expert-unicode']
title = 'Rewrite import machinery to work with unicode paths'
updated_at = <Date 2010-10-19.01:02:36.455>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2010-10-19.01:02:36.455>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2010-10-19.01:02:36.457>
closer = 'vstinner'
components = ['Interpreter Core', 'Unicode']
creation = <Date 2010-07-30.00:13:32.930>
creator = 'vstinner'
dependencies = []
files = []
hgrepos = []
issue_num = 9425
keywords = ['patch', 'buildbot']
message_count = 58.0
messages = ['112026', '112027', '112032', '112038', '112039', '112213', '113164', '113165', '113255', '113256', '113259', '113261', '113308', '113342', '113347', '113351', '113353', '113354', '113355', '113546', '113548', '113598', '113726', '113761', '113764', '113771', '113785', '113795', '113796', '113834', '113835', '113843', '113852', '113855', '113859', '113862', '113903', '113904', '113913', '113915', '113955', '113956', '114002', '114059', '114062', '114078', '114080', '114087', '114089', '114090', '114192', '114819', '114827', '114944', '115180', '115343', '117630', '119097']
nosy_count = 9.0
nosy_names = ['amaury.forgeotdarc', 'pitrou', 'kristjan.jonsson', 'vstinner', 'ezio.melotti', 'eric.araujo', 'Arfrever', 'flox', 'Romme']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue9425'
versions = ['Python 3.2']

vstinner · 2010-07-30T00:13:28Z

Python (2 and 3) is unable to load a module installed in a directory containing characters not encodable to the locale encoding. And Python doesn't work if it's installed in non-ASCII directory on Windows or with a locale encoding different than UTF-8. On Windows, the locale encoding is "mbcs", which is a small charset, unable to mix different languages, whereas the file system is fully unicode compatible (it uses UTF-16). Python should work with unicode strings (wchar_t*, Py_UNICODE* or PyUnicodeObject) instead of byte strings (char* or PyBytesObject), especially while loading a Python module.

It's not an easy task because it requires to change a lot of code, especially in Python/import.c. I am working on this topic since some months and I have now a working patch. It's now possible to run Python from the source tree containing a non-ASCII character in C locale (ASCII encoding). Except just a minor bug in test_gdb, all tests of the test suite pass.

I posted the whole patch on Rietveld for a review:
http://codereview.appspot.com/1874048

The patch is huge because it fixes different things:

a) import machinery (import.c, getpath.c, importdl.c, ...)
b) many error handlers using filenames (compile.c, errors.c, _warnings.c, sysmodule.c, ...)
c) functions using filenames, especially Python full path: log the filename (eg. Lib/distutils/file_util.py), filename written to a program output (eg. Lib/platform.py)
d) tests (Lib/test/test_*.py)

(b), (c) and (d) can be fixed before/without (a). But (a) requires other parts to work correctly.

If it's not possible to review the patch, I can try to split it in smaller parts.

--

Related issues:

bpo-3080: Full unicode import system
bpo-4352: imp.find_module() fails with a UnicodeDecodeError
when called with non-ASCII search paths
bpo-8611: Python3 doesn't support locale different than utf8
and an non-ASCII path (POSIX)
bpo-8988: import + coding = failure (3.1.2/win32)

--

See also my email sent to python-dev for more information:
http://mail.python.org/pipermail/python-dev/2010-July/101619.html

vstinner · 2010-07-30T00:18:15Z

Oh, I forgot to say that I created an svn branch including my work: import_unicode.
http://svn.python.org/view/python/branches/import_unicode/

You can try it if you prefer svn to an huge patch.

I created a branch so you can follow my work commit by commit using svn history.

--

The patch is not completly done. There are still remaining FIXMEs. Some FIXME are not bugs, but improvments. The most important FIXME is to restore the support of bytes path in sys.path. I removed it temporary, because it was easier for me.

ezio-melotti · 2010-07-30T00:42:11Z

I wrote a few minor comments on codereview.
The patch should also include more tests.

vstinner · 2010-07-30T02:23:45Z

The patch should also include more tests.

Which kind of test? Run the test suite in a non-ASCII directory with encoding different than utf-8 is enough. If the patch is accepted, the solution is maybe a specific buildbot.

vstinner · 2010-07-30T03:25:18Z

Another important TODO: use weak references for the code objects list.

--

I tested my patch on Windows. I fixes bpo-8988 because non-ASCII characters are now correctly decoded with mbcs and not UTF-8. But it doesn't work with characters not encodable to mbcs. It looks like there are some remaining code using byte string. I fixed some of them in import_unicode branch, but it's not enough.

It is not easy to investigate because Visual Studio refuse to compile the project if the project directory contains a character not encodable to mbcs. And it is unable to debug python if the project directory is renamed after the compilation. I will maybe retry with Cygwin or with the old school "printf" method.

It looks like few Windows applications support characters not encodable to mbcs (locale encoding): MinGW and WinSCP do neither support such characters.

vstinner · 2010-07-31T21:51:56Z

After some tests on Windows, I realized that my patch is not enough to be fully unicode compliant (on Windows). Some functions are still using PyUnicode_DecodeFSDefault() or PyUnicode_EncodeFSDefault(). Until all functions are patched to use unicode strings, Python3 will not be fully unicode compliant *on Windows*. The problem is specific to Windows, because Python uses mbcs codec which doesn't support surrogateescape error handler.

I think that this patch is already huge and complex, and it will be difficult to fix all issues at the same time. This patch does improve the situation: with the patch, Python is fully unicode compliant (except on Windows), and it fixes at least one issue on Windows (bpo-8988, it now uses the right encoding).

vstinner · 2010-08-07T10:49:53Z

The patch is too huge to be commited at once. I will split it again into smaller parts.

First related commit: r83778 fixes tests for not encodable filenames.

vstinner · 2010-08-07T10:57:41Z

r83779 creates run_command(), it's just a refactorization.

vstinner · 2010-08-08T12:49:34Z

_Py_wchar2char.patch: create _Py_wchar2char() private function, and _wstat() and _wfopen() use it. _Py_wchar2char() function has been improved since the previous version posted to Rietveld: it now computes the exact length of the output buffer, instead of using wcslen(text)*10+1.

Alone, this patch isn't really useful, but it prepares the code for next patches.

vstinner · 2010-08-08T12:50:35Z

r83783 creates run_file() subfunction.

vstinner · 2010-08-08T13:16:11Z

pyerr_warnformat.patch: create PyErr_WarnFormat() function, and use it in PyType_Ready() and PyUnicode_AsEncodedString(). The patch fixes also setup_context(): work on the unicode filename, not the encoded (bytes) filename. It does fix a bug because len is a number of characters, not a number of bytes: the number of bytes is bigger than the number of characters if the filename contains a non-ASCII character.

Advantages of PyErr_WarnFormat() over PyOS_snprintf() + PyErr_WarnEx():

it avoids the create a temporary byte buffer: use directly an unicode buffer,
it accepts Python (unicode) formatters like %U,
it avoids the usage of a fixed size buffer allocated on the stack (which may be too big).

Differences with Rietveld's version: rename PyErr_WarnUnicode() to warn_unicode() (it's now a static function) and document PyErr_WarnFormat().

vstinner · 2010-08-08T13:29:04Z

nullimporter_unicode.patch: patch NullImporter_init():

use GetFileAttributesW() instead of GetFileAttributesA() for the Windows version to be fully Unicode compliant
use "O&" format with PyUnicode_FSConverter instead of "es" with Py_FileSystemDefaultEncoding to accept also bytes filenames and support str with surrogates (PEP-383)

pitrou · 2010-08-08T20:13:07Z

It looks like you are a fixing a bug in setup_context() at the same time as you introduce PyErr_WarnFormat(). Both changes should probably go in separately.

The PyErr_WarnFormat() doc needs a "versionadded" tag.

vstinner · 2010-08-08T22:19:36Z

pitrou> It looks like you are a fixing a bug in setup_context()
pitrou> at the same time as you introduce PyErr_WarnFormat().
pitrou> Both changes should probably go in separately.

Right. r83860 fixes the bug, and I attached a new version of the patch (with :versionadded:).

vstinner · 2010-08-08T22:25:32Z

gutworth's comment about r83860: "Test?"

vstinner · 2010-08-08T23:29:45Z

Py_UNICODE_strrchr.patch: Create Py_UNICODE_strrchr() function. It will be used for zipimport to work on unicode paths instead of bytes paths.

Antoine noticed that the input string is const whereas the output string is not const, which is unusual. I copy/pasted Py_UNICODE_strchr() prototype.

I suppose that const input and non const input is required to be able to use the function on const strings. The GNU libc uses the same strchr() prototype in its C version of string.h. In the C++ version of the header, it defines the strchr() twice: once with const input and output, once with non const input and ouput. The right solution is the C++ way, but C doesn't support polymophism.

vstinner · 2010-08-08T23:57:39Z

I created a separated issue, bpo-9542, to add the new function PyUnicode_FSDecoder().

vstinner · 2010-08-09T00:39:29Z

_Py_stat.patch: create _Py_stat() function. It will be used in import.c and zipimport.c.

I added the function to import.c because, initially, I only used it there. But it's maybe not the best place for such function. posixmodule.c doesn't fit because it is not part of the bootstrap process.

I created this function to get full unicode support on Windows (don't fallback to bytes using the evil mbcs encoding).

In import.c and zipimport.c, it is used to check if the path is a regular file or if the path is a directory. That's why _Py_stat() only fills st_mode attribute (it's just enough).

A better API would be maybe functions checking directly these properties? Maybe Py_isdir() (as os.path.isdir()) and Py_isreg()? Or if you prefer longer names: Py_is_directory() ad Py_is_regular_file()? Such functions can be implemented differently, eg. use GetFileAttributesW on Windows. I say that because of a comment found in NullImporter_init():

/* see bpo-1293 and bpo-3677:
 * stat() on Windows doesn't recognise paths like
 * "e:\\shared\\" and "\\\\whiterab-c2znlh\\shared" as dirs.
 */

vstinner · 2010-08-09T01:00:29Z

r83870 creates load_builtin() subfunction in import.c to prepare and simplify the big patch.

vstinner · 2010-08-10T16:38:09Z

I commited Py_UNICODE_strrchr.patch as r83933 after removing the useless start variable.

vstinner · 2010-08-10T16:56:33Z

_PyFile_FromFdUnicode.patch: create _PyFile_FromFdUnicode() function. It will be used in import.c to open a file using an unicode filename.

For _PyFile_FromFd(), I kept the previous behaviour: clear the exception on PyUnicode_DecodeFSDefault() error.

For fileobject.h: I used the same style than unicodeobject.h, one argument per line with their name. I prefer to write the argument name because the header can be used as a quick documentation.

As _PyFile_FromFd(), name is optional (can be NULL) for _PyFile_FromFdUnicode().

pitrou · 2010-08-11T10:10:06Z

Actually, I'm not sure there's much point since the "name" attribute is currently read-only:

>>> f = open(1, "wb")
>>> f.name = "foo"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: attribute 'name' of '_io.BufferedWriter' objects is not writable
>>> 
>>> g = open(1, "w")
>>> g.name = "bar"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: attribute 'name' of '_io.TextIOWrapper' objects is not writable

vstinner · 2010-08-13T00:44:27Z

(About PyFile_FromFd)
pitrou> Actually, I'm not sure there's much point since the "name"
pitrou> attribute is currently read-only: (...)

Oh, it remembers me bpo-4762. I closed this issue with the message "The last problem occurs with imp.find_module(). But imp.find_module() also returns a "filename" argument, so I don't think that the issue really matters. Let's close it ;-)".

Even if it would be possible to set f(.buffer).raw.name, the solution is maybe just to ignore the argument (don't set any name attribute). Can we change such public function?

vstinner · 2010-08-13T13:02:33Z

r83971 enables test.support.TESTFN_UNDECODEABLE on non-Windows OSes.

vstinner · 2010-08-13T13:08:08Z

I commited nullimporter_unicode.patch with an unit test as r83972.

vstinner · 2010-08-13T13:36:12Z

r83973 ignores the name argument of PyFile_FromFd() because it was already ignored (it did always produce an error) and it avoids my complex _PyFile_FromFdUnicode.patch. Thanks Antoine to having notice that name was ignored.

vstinner · 2010-08-13T15:22:42Z

Note about _Py_wchar2char(): it is possible to convert character by character (instead of working on substrings) because the input string doesn't contain surrogate pairs. _Py_char2wchar() ensures the the output string doens't contain surrogate pairs: if a byte sequence produces a surrogate pairs, the byte sequence is encoded using the surrogateescape error handler (U+DC00..U+DCFF range). I should add this note in _Py_wchar2char() comment.

vstinner · 2010-08-13T16:39:19Z

r83981 closes bpo-9560: avoid the filename in _syscmd_file() to fix a bug with non encodable filenames in platform.architecture().

merwok · 2010-08-13T21:29:53Z

I know this is not introduced by your patch, just moved, but couldn’t
the typo in UNDECODEABLE be fixed? (extraneous e)

vstinner · 2010-08-13T22:24:11Z

I know this is not introduced by your patch, just moved, but couldn’t
the typo in UNDECODEABLE be fixed? (extraneous e)

I wasn't sure that it was a typo, so I kept it unchanged. It's now fixed by
r83987.

vstinner · 2010-08-13T23:30:57Z

r83989 creates _Py_wchar2char() function (_Py_wchar2char-2.patch).

vstinner · 2010-08-14T00:01:19Z

r83990 closes bpo-9542 by creating the PyUnicode_FSDecoder() PyArg_ParseTuple parser.

vstinner · 2010-08-14T01:04:42Z

r83976 adds PyErr_WarnFormat() (pyerr_warnformat-2.patch).

vstinner · 2010-08-14T01:22:15Z

I created bpo-9599: Add PySys_FormatStdout and PySys_FormatStderr functions.

vstinner · 2010-08-14T14:52:07Z

r84012 creates _Py_stat(). It is a little bit different than the attached patch (_Py_stat.patch): it doesn't clear Python exception on unicode conversion error.

vstinner · 2010-08-14T14:55:08Z

r84012 patchs zipimporter_init() to use the new PyUnicode_FSDecoder() and use Py_UNICODE* (unicode) strings instead of char* (byte) strings.

vstinner · 2010-08-14T17:07:12Z

r84030 creates _Py_fopen() for PyUnicodeObject path.

vstinner · 2010-08-14T17:13:40Z

zipimport_read_directory.patch: patch for read_directory() function of the zipimport module to support unencodable filenames. This patch requires bpo-9599 (PySys_FormatStderr). The patch changes the encoding of the name: decode name byte string using the file system encoding (and the PEP-383 on POSIX) instead of the utf-8 in strict mode.

florentx · 2010-08-15T13:01:39Z

r83972 breaks OS X buildbots: support.TESTFN_UNENCODABLE is not defined if sys.platform == 'darwin'.

File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/test_imp.py", line 309, in <module>
class NullImporterTests(unittest.TestCase):
File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/test_imp.py", line 310, in NullImporterTests
@unittest.skipIf(support.TESTFN_UNENCODABLE is None,
AttributeError: 'module' object has no attribute 'TESTFN_UNENCODABLE'

florentx · 2010-08-15T13:07:59Z

It breaks test_unicode_file on OS X, too:

File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/test_unicode_file.py", line 8, in <module>
from test.support import (run_unittest, rmtree,
ImportError: cannot import name TESTFN_UNENCODABLE

vstinner · 2010-08-15T19:30:05Z

I tried to fix Mac OS X (TESTFN_UNENCODABLE) with r84035, but I don't have access to Mac OS X to test and my patch was not correct. It should now be ok with r84080.

vstinner · 2010-08-16T17:55:11Z

zipimport_read_directory.patch commited as r84095.

vstinner · 2010-08-16T18:43:11Z

Py_UNICODE_strncmp.patch: create Py_UNICODE_strncmp() function.

vstinner · 2010-08-16T21:39:12Z

Py_UNICODE_strncmp.patch was wrong for n=0. New version based on libiberty/strncmp.c source code.

vstinner · 2010-08-16T22:04:17Z

Py_UNICODE_strncmp-2.patch commited as r84111.

vstinner · 2010-08-16T23:49:18Z

r84120: get_data() function of zipimport uses an unicode path.

vstinner · 2010-08-17T00:05:32Z

r84121: repr() method zipimporter object uses unicode.

vstinner · 2010-08-17T00:43:02Z

r84122 saves/restores the exception around "filename = _PyUnicode_AsString(co->co_filename);" because it raises an unicode error on unencodable filename.

vstinner · 2010-08-17T23:48:59Z

r84168 creates PyModule_GetFilenameObject().

I created a separated issue for the patch reencoding all filenames when setting the filesystem encoding: bpo-9630.

vstinner · 2010-08-24T20:36:11Z

Rewrite import machinery to work with unicode paths #53671

Rewrite import machinery to work with unicode paths #53671

Comments

vstinner commented Jul 30, 2010

vstinner commented Jul 30, 2010

vstinner commented Jul 30, 2010

ezio-melotti commented Jul 30, 2010

vstinner commented Jul 30, 2010

vstinner commented Jul 30, 2010

vstinner commented Jul 31, 2010

vstinner commented Aug 7, 2010

vstinner commented Aug 7, 2010

vstinner commented Aug 8, 2010

vstinner commented Aug 8, 2010

vstinner commented Aug 8, 2010

vstinner commented Aug 8, 2010

pitrou commented Aug 8, 2010

vstinner commented Aug 8, 2010

vstinner commented Aug 8, 2010

vstinner commented Aug 8, 2010

vstinner commented Aug 8, 2010

vstinner commented Aug 9, 2010

vstinner commented Aug 9, 2010

vstinner commented Aug 10, 2010

vstinner commented Aug 10, 2010

pitrou commented Aug 11, 2010

vstinner commented Aug 13, 2010

vstinner commented Aug 13, 2010

vstinner commented Aug 13, 2010

vstinner commented Aug 13, 2010

vstinner commented Aug 13, 2010

vstinner commented Aug 13, 2010

merwok commented Aug 13, 2010

vstinner commented Aug 13, 2010

vstinner commented Aug 13, 2010

vstinner commented Aug 14, 2010

vstinner commented Aug 14, 2010

vstinner commented Aug 14, 2010

vstinner commented Aug 14, 2010

vstinner commented Aug 14, 2010

vstinner commented Aug 14, 2010

vstinner commented Aug 14, 2010

florentx mannequin commented Aug 15, 2010

florentx mannequin commented Aug 15, 2010

vstinner commented Aug 15, 2010

vstinner commented Aug 16, 2010

vstinner commented Aug 16, 2010

vstinner commented Aug 16, 2010

vstinner commented Aug 16, 2010

vstinner commented Aug 16, 2010

vstinner commented Aug 17, 2010

vstinner commented Aug 17, 2010

vstinner commented Aug 17, 2010

vstinner commented Aug 24, 2010

kristjanvalur mannequin commented Aug 24, 2010

vstinner commented Aug 25, 2010

vstinner commented Aug 29, 2010

vstinner commented Sep 1, 2010

vstinner commented Sep 29, 2010

vstinner commented Oct 19, 2010