Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite import machinery to work with unicode paths #53671

Closed
vstinner opened this issue Jul 30, 2010 · 58 comments
Closed

Rewrite import machinery to work with unicode paths #53671

vstinner opened this issue Jul 30, 2010 · 58 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode

Comments

@vstinner
Copy link
Member

BPO 9425
Nosy @amauryfa, @pitrou, @kristjanvalur, @vstinner, @ezio-melotti, @merwok, @florentx

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2010-10-19.01:02:36.457>
created_at = <Date 2010-07-30.00:13:32.930>
labels = ['interpreter-core', 'expert-unicode']
title = 'Rewrite import machinery to work with unicode paths'
updated_at = <Date 2010-10-19.01:02:36.455>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2010-10-19.01:02:36.455>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2010-10-19.01:02:36.457>
closer = 'vstinner'
components = ['Interpreter Core', 'Unicode']
creation = <Date 2010-07-30.00:13:32.930>
creator = 'vstinner'
dependencies = []
files = []
hgrepos = []
issue_num = 9425
keywords = ['patch', 'buildbot']
message_count = 58.0
messages = ['112026', '112027', '112032', '112038', '112039', '112213', '113164', '113165', '113255', '113256', '113259', '113261', '113308', '113342', '113347', '113351', '113353', '113354', '113355', '113546', '113548', '113598', '113726', '113761', '113764', '113771', '113785', '113795', '113796', '113834', '113835', '113843', '113852', '113855', '113859', '113862', '113903', '113904', '113913', '113915', '113955', '113956', '114002', '114059', '114062', '114078', '114080', '114087', '114089', '114090', '114192', '114819', '114827', '114944', '115180', '115343', '117630', '119097']
nosy_count = 9.0
nosy_names = ['amaury.forgeotdarc', 'pitrou', 'kristjan.jonsson', 'vstinner', 'ezio.melotti', 'eric.araujo', 'Arfrever', 'flox', 'Romme']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue9425'
versions = ['Python 3.2']

@vstinner
Copy link
Member Author

Python (2 and 3) is unable to load a module installed in a directory containing characters not encodable to the locale encoding. And Python doesn't work if it's installed in non-ASCII directory on Windows or with a locale encoding different than UTF-8. On Windows, the locale encoding is "mbcs", which is a small charset, unable to mix different languages, whereas the file system is fully unicode compatible (it uses UTF-16). Python should work with unicode strings (wchar_t*, Py_UNICODE* or PyUnicodeObject) instead of byte strings (char* or PyBytesObject), especially while loading a Python module.

It's not an easy task because it requires to change a lot of code, especially in Python/import.c. I am working on this topic since some months and I have now a working patch. It's now possible to run Python from the source tree containing a non-ASCII character in C locale (ASCII encoding). Except just a minor bug in test_gdb, all tests of the test suite pass.

I posted the whole patch on Rietveld for a review:
http://codereview.appspot.com/1874048

The patch is huge because it fixes different things:

a) import machinery (import.c, getpath.c, importdl.c, ...)
b) many error handlers using filenames (compile.c, errors.c, _warnings.c, sysmodule.c, ...)
c) functions using filenames, especially Python full path: log the filename (eg. Lib/distutils/file_util.py), filename written to a program output (eg. Lib/platform.py)
d) tests (Lib/test/test_*.py)

(b), (c) and (d) can be fixed before/without (a). But (a) requires other parts to work correctly.

If it's not possible to review the patch, I can try to split it in smaller parts.

--

Related issues:

bpo-3080: Full unicode import system
bpo-4352: imp.find_module() fails with a UnicodeDecodeError
when called with non-ASCII search paths
bpo-8611: Python3 doesn't support locale different than utf8
and an non-ASCII path (POSIX)
bpo-8988: import + coding = failure (3.1.2/win32)

--

See also my email sent to python-dev for more information:
http://mail.python.org/pipermail/python-dev/2010-July/101619.html

@vstinner vstinner added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode labels Jul 30, 2010
@vstinner
Copy link
Member Author

Oh, I forgot to say that I created an svn branch including my work: import_unicode.
http://svn.python.org/view/python/branches/import_unicode/

You can try it if you prefer svn to an huge patch.

I created a branch so you can follow my work commit by commit using svn history.

--

The patch is not completly done. There are still remaining FIXMEs. Some FIXME are not bugs, but improvments. The most important FIXME is to restore the support of bytes path in sys.path. I removed it temporary, because it was easier for me.

@ezio-melotti
Copy link
Member

I wrote a few minor comments on codereview.
The patch should also include more tests.

@vstinner
Copy link
Member Author

The patch should also include more tests.

Which kind of test? Run the test suite in a non-ASCII directory with encoding different than utf-8 is enough. If the patch is accepted, the solution is maybe a specific buildbot.

@vstinner
Copy link
Member Author

Another important TODO: use weak references for the code objects list.

--

I tested my patch on Windows. I fixes bpo-8988 because non-ASCII characters are now correctly decoded with mbcs and not UTF-8. But it doesn't work with characters not encodable to mbcs. It looks like there are some remaining code using byte string. I fixed some of them in import_unicode branch, but it's not enough.

It is not easy to investigate because Visual Studio refuse to compile the project if the project directory contains a character not encodable to mbcs. And it is unable to debug python if the project directory is renamed after the compilation. I will maybe retry with Cygwin or with the old school "printf" method.

It looks like few Windows applications support characters not encodable to mbcs (locale encoding): MinGW and WinSCP do neither support such characters.

@vstinner
Copy link
Member Author

After some tests on Windows, I realized that my patch is not enough to be fully unicode compliant (on Windows). Some functions are still using PyUnicode_DecodeFSDefault() or PyUnicode_EncodeFSDefault(). Until all functions are patched to use unicode strings, Python3 will not be fully unicode compliant *on Windows*. The problem is specific to Windows, because Python uses mbcs codec which doesn't support surrogateescape error handler.

I think that this patch is already huge and complex, and it will be difficult to fix all issues at the same time. This patch does improve the situation: with the patch, Python is fully unicode compliant (except on Windows), and it fixes at least one issue on Windows (bpo-8988, it now uses the right encoding).

@vstinner
Copy link
Member Author

vstinner commented Aug 7, 2010

The patch is too huge to be commited at once. I will split it again into smaller parts.

First related commit: r83778 fixes tests for not encodable filenames.

@vstinner
Copy link
Member Author

vstinner commented Aug 7, 2010

r83779 creates run_command(), it's just a refactorization.

@vstinner
Copy link
Member Author

vstinner commented Aug 8, 2010

_Py_wchar2char.patch: create _Py_wchar2char() private function, and _wstat() and _wfopen() use it. _Py_wchar2char() function has been improved since the previous version posted to Rietveld: it now computes the exact length of the output buffer, instead of using wcslen(text)*10+1.

Alone, this patch isn't really useful, but it prepares the code for next patches.

@vstinner
Copy link
Member Author

vstinner commented Aug 8, 2010

r83783 creates run_file() subfunction.

@vstinner
Copy link
Member Author

vstinner commented Aug 8, 2010

pyerr_warnformat.patch: create PyErr_WarnFormat() function, and use it in PyType_Ready() and PyUnicode_AsEncodedString(). The patch fixes also setup_context(): work on the unicode filename, not the encoded (bytes) filename. It does fix a bug because len is a number of characters, not a number of bytes: the number of bytes is bigger than the number of characters if the filename contains a non-ASCII character.

Advantages of PyErr_WarnFormat() over PyOS_snprintf() + PyErr_WarnEx():

  • it avoids the create a temporary byte buffer: use directly an unicode buffer,
  • it accepts Python (unicode) formatters like %U,
  • it avoids the usage of a fixed size buffer allocated on the stack (which may be too big).

Differences with Rietveld's version: rename PyErr_WarnUnicode() to warn_unicode() (it's now a static function) and document PyErr_WarnFormat().

@vstinner
Copy link
Member Author

vstinner commented Aug 8, 2010

nullimporter_unicode.patch: patch NullImporter_init():

  • use GetFileAttributesW() instead of GetFileAttributesA() for the Windows version to be fully Unicode compliant
  • use "O&" format with PyUnicode_FSConverter instead of "es" with Py_FileSystemDefaultEncoding to accept also bytes filenames and support str with surrogates (PEP-383)

@pitrou
Copy link
Member

pitrou commented Aug 8, 2010

It looks like you are a fixing a bug in setup_context() at the same time as you introduce PyErr_WarnFormat(). Both changes should probably go in separately.

The PyErr_WarnFormat() doc needs a "versionadded" tag.

@vstinner
Copy link
Member Author

vstinner commented Aug 8, 2010

pitrou> It looks like you are a fixing a bug in setup_context()
pitrou> at the same time as you introduce PyErr_WarnFormat().
pitrou> Both changes should probably go in separately.

Right. r83860 fixes the bug, and I attached a new version of the patch (with :versionadded:).

@vstinner
Copy link
Member Author

vstinner commented Aug 8, 2010

gutworth's comment about r83860: "Test?"

@vstinner
Copy link
Member Author

vstinner commented Aug 8, 2010

Py_UNICODE_strrchr.patch: Create Py_UNICODE_strrchr() function. It will be used for zipimport to work on unicode paths instead of bytes paths.

Antoine noticed that the input string is const whereas the output string is not const, which is unusual. I copy/pasted Py_UNICODE_strchr() prototype.

I suppose that const input and non const input is required to be able to use the function on const strings. The GNU libc uses the same strchr() prototype in its C version of string.h. In the C++ version of the header, it defines the strchr() twice: once with const input and output, once with non const input and ouput. The right solution is the C++ way, but C doesn't support polymophism.

@vstinner
Copy link
Member Author

vstinner commented Aug 8, 2010

I created a separated issue, bpo-9542, to add the new function PyUnicode_FSDecoder().

@vstinner
Copy link
Member Author

vstinner commented Aug 9, 2010

_Py_stat.patch: create _Py_stat() function. It will be used in import.c and zipimport.c.

I added the function to import.c because, initially, I only used it there. But it's maybe not the best place for such function. posixmodule.c doesn't fit because it is not part of the bootstrap process.

I created this function to get full unicode support on Windows (don't fallback to bytes using the evil mbcs encoding).

In import.c and zipimport.c, it is used to check if the path is a regular file or if the path is a directory. That's why _Py_stat() only fills st_mode attribute (it's just enough).

A better API would be maybe functions checking directly these properties? Maybe Py_isdir() (as os.path.isdir()) and Py_isreg()? Or if you prefer longer names: Py_is_directory() ad Py_is_regular_file()? Such functions can be implemented differently, eg. use GetFileAttributesW on Windows. I say that because of a comment found in NullImporter_init():

/* see bpo-1293 and bpo-3677:
 * stat() on Windows doesn't recognise paths like
 * "e:\\shared\\" and "\\\\whiterab-c2znlh\\shared" as dirs.
 */

@vstinner
Copy link
Member Author

vstinner commented Aug 9, 2010

r83870 creates load_builtin() subfunction in import.c to prepare and simplify the big patch.

@vstinner
Copy link
Member Author

I commited Py_UNICODE_strrchr.patch as r83933 after removing the useless start variable.

@vstinner
Copy link
Member Author

_PyFile_FromFdUnicode.patch: create _PyFile_FromFdUnicode() function. It will be used in import.c to open a file using an unicode filename.

For _PyFile_FromFd(), I kept the previous behaviour: clear the exception on PyUnicode_DecodeFSDefault() error.

For fileobject.h: I used the same style than unicodeobject.h, one argument per line with their name. I prefer to write the argument name because the header can be used as a quick documentation.

As _PyFile_FromFd(), name is optional (can be NULL) for _PyFile_FromFdUnicode().

@pitrou
Copy link
Member

pitrou commented Aug 11, 2010

Actually, I'm not sure there's much point since the "name" attribute is currently read-only:

>>> f = open(1, "wb")
>>> f.name = "foo"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: attribute 'name' of '_io.BufferedWriter' objects is not writable
>>> 
>>> g = open(1, "w")
>>> g.name = "bar"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: attribute 'name' of '_io.TextIOWrapper' objects is not writable

@vstinner
Copy link
Member Author

(About PyFile_FromFd)
pitrou> Actually, I'm not sure there's much point since the "name"
pitrou> attribute is currently read-only: (...)

Oh, it remembers me bpo-4762. I closed this issue with the message "The last problem occurs with imp.find_module(). But imp.find_module() also returns a "filename" argument, so I don't think that the issue really matters. Let's close it ;-)".

Even if it would be possible to set f(.buffer).raw.name, the solution is maybe just to ignore the argument (don't set any name attribute). Can we change such public function?

@vstinner
Copy link
Member Author

r83971 enables test.support.TESTFN_UNDECODEABLE on non-Windows OSes.

@vstinner
Copy link
Member Author

I commited nullimporter_unicode.patch with an unit test as r83972.

@vstinner
Copy link
Member Author

r83973 ignores the name argument of PyFile_FromFd() because it was already ignored (it did always produce an error) and it avoids my complex _PyFile_FromFdUnicode.patch. Thanks Antoine to having notice that name was ignored.

@vstinner
Copy link
Member Author

Note about _Py_wchar2char(): it is possible to convert character by character (instead of working on substrings) because the input string doesn't contain surrogate pairs. _Py_char2wchar() ensures the the output string doens't contain surrogate pairs: if a byte sequence produces a surrogate pairs, the byte sequence is encoded using the surrogateescape error handler (U+DC00..U+DCFF range). I should add this note in _Py_wchar2char() comment.

@vstinner
Copy link
Member Author

r83981 closes bpo-9560: avoid the filename in _syscmd_file() to fix a bug with non encodable filenames in platform.architecture().

@merwok
Copy link
Member

merwok commented Aug 13, 2010

I know this is not introduced by your patch, just moved, but couldn’t
the typo in UNDECODEABLE be fixed? (extraneous e)

@vstinner
Copy link
Member Author

I know this is not introduced by your patch, just moved, but couldn’t
the typo in UNDECODEABLE be fixed? (extraneous e)

I wasn't sure that it was a typo, so I kept it unchanged. It's now fixed by
r83987.

@vstinner
Copy link
Member Author

r83989 creates _Py_wchar2char() function (_Py_wchar2char-2.patch).

@vstinner
Copy link
Member Author

r83990 closes bpo-9542 by creating the PyUnicode_FSDecoder() PyArg_ParseTuple parser.

@vstinner
Copy link
Member Author

r83976 adds PyErr_WarnFormat() (pyerr_warnformat-2.patch).

@vstinner
Copy link
Member Author

I created bpo-9599: Add PySys_FormatStdout and PySys_FormatStderr functions.

@vstinner
Copy link
Member Author

r84012 creates _Py_stat(). It is a little bit different than the attached patch (_Py_stat.patch): it doesn't clear Python exception on unicode conversion error.

@vstinner
Copy link
Member Author

r84012 patchs zipimporter_init() to use the new PyUnicode_FSDecoder() and use Py_UNICODE* (unicode) strings instead of char* (byte) strings.

@vstinner
Copy link
Member Author

r84030 creates _Py_fopen() for PyUnicodeObject path.

@vstinner
Copy link
Member Author

zipimport_read_directory.patch: patch for read_directory() function of the zipimport module to support unencodable filenames. This patch requires bpo-9599 (PySys_FormatStderr). The patch changes the encoding of the name: decode name byte string using the file system encoding (and the PEP-383 on POSIX) instead of the utf-8 in strict mode.

@florentx
Copy link
Mannequin

florentx mannequin commented Aug 15, 2010

r83972 breaks OS X buildbots: support.TESTFN_UNENCODABLE is not defined if sys.platform == 'darwin'.

File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/test_imp.py", line 309, in <module>
class NullImporterTests(unittest.TestCase):
File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/test_imp.py", line 310, in NullImporterTests
@unittest.skipIf(support.TESTFN_UNENCODABLE is None,
AttributeError: 'module' object has no attribute 'TESTFN_UNENCODABLE'

@florentx
Copy link
Mannequin

florentx mannequin commented Aug 15, 2010

It breaks test_unicode_file on OS X, too:

File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/test_unicode_file.py", line 8, in <module>
from test.support import (run_unittest, rmtree,
ImportError: cannot import name TESTFN_UNENCODABLE

@vstinner
Copy link
Member Author

I tried to fix Mac OS X (TESTFN_UNENCODABLE) with r84035, but I don't have access to Mac OS X to test and my patch was not correct. It should now be ok with r84080.

@vstinner
Copy link
Member Author

zipimport_read_directory.patch commited as r84095.

@vstinner
Copy link
Member Author

Py_UNICODE_strncmp.patch: create Py_UNICODE_strncmp() function.

@vstinner
Copy link
Member Author

Py_UNICODE_strncmp.patch was wrong for n=0. New version based on libiberty/strncmp.c source code.

@vstinner
Copy link
Member Author

Py_UNICODE_strncmp-2.patch commited as r84111.

@vstinner
Copy link
Member Author

r84120: get_data() function of zipimport uses an unicode path.

@vstinner
Copy link
Member Author

r84121: repr() method zipimporter object uses unicode.

@vstinner
Copy link
Member Author

r84122 saves/restores the exception around "filename = _PyUnicode_AsString(co->co_filename);" because it raises an unicode error on unencodable filename.

@vstinner
Copy link
Member Author

r84168 creates PyModule_GetFilenameObject().

I created a separated issue for the patch reencoding all filenames when setting the filesystem encoding: bpo-9630.

@vstinner
Copy link
Member Author

See also bpo-1552880.

@kristjanvalur
Copy link
Mannequin

kristjanvalur mannequin commented Aug 24, 2010

Yes. in bpo-1552880 I tried to make as minimal a change as possible. This particular patch is still in use in EVE Online, which is installed in various strange and exotic paths in the orient..

The trick I employed there was to encode everything to utf-8 at the earliest oppertunity (current working directory, any unicode members in sys.path, etc.) and let the import.c machinery crunch that utf-8 code. This works because path separators and other such stuff doesn't change under the utf-8 encoding. As a final step, the utf8 encoded working string is converted back to unicode and native unicode API calls (on windows) are used to stat() and open() files.

A similar trick could be used on unix by converting from utf-8 to whatever native encoding the stat() and open() calls expect.

My patch never got accepted because I didn't have the time to put in the extra effort to make it cross platform.

@vstinner
Copy link
Member Author

r84012 patchs zipimporter_init() to use the new PyUnicode_FSDecoder()
and use Py_UNICODE* (unicode) strings instead of char* (byte) strings.

oops, it's r84013 (not r84012)

@vstinner
Copy link
Member Author

Py_UNICODE_strcat.patch: create Py_UNICODE_strcat() function.

Py_UNICODE_strdup.patch: create Py_UNICODE_strdup() function.

@vstinner
Copy link
Member Author

vstinner commented Sep 1, 2010

r84429 creates Py_UNICODE_strcat() (change with the patch: return the right value).

r84430 creates PyUnicode_strdup() (change with the patch: rename the function from Py_UNICODE_strdup() to PyUnicode_strdup() and mangle the function name).

@vstinner
Copy link
Member Author

r85115 closes bpo-9630: an important patch for bpo-9425, redecode all filenames when setting the filesystem encoding.

Next tasks (maybe not in this order):

  • merge getpath.c
  • redecode argv[0] used by PySys_SetArgvEx() to feed sys.path (encode argv[0] with the locale encoding and then decode it using the filesystem encoding): it is required if PYTHONFSENCODING environment variable is used
  • merge import.c (in small patchs if it's possible)
  • and other things that I forgot

@vstinner
Copy link
Member Author

Starting at r85691, the full test suite of Python 3.2 pass with ASCII, ISO-8859-1 and UTF-8 locale encodings in a non-ascii directory. The work on this issue is done.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode
Projects
None yet
Development

No branches or pull requests

4 participants