New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full unicode import system #47330
Comments
This is the most difficult part of bpo-1342: This imply to rewrite all functions in import.c, and replace all char* |
I suspect importlib may help with this. |
Victor is working on this. |
I posted a patch: bpo-9425. |
With bpo-8611 and bpo-9425, I patched a lot of functions and modules, including the NullImporter and zipimport, but not the core of the import machinery. In my import_unicode SVN branch, I patched the import machinery to manipulate unicode strings, instead of bytes strings. But the patch is huge and the import machinery is fragile. Since Python 3.2 now works in a non-ASCII directory with an ASCII locale (fileystem) encoding, I don't plan to merge the patch into py3k. The patch is still useful on Windows, because Python uses the mbcs encoding to encode/decode filenames, and this encoding is usually a very small subset of Unicode (eg. cp1252 is 256 codes wheres unicode 6.0 has 109,449 characters). |
With bpo-1342 fixed, it seems that this issue is no longer critical (Haypo describes his complicated patch as "useful on Windows", but not critical. So I'm downgrading it to 'high'. Perhaps it is even 'normal'. It also seems as though it is currently languishing unless someone wants to pick it up. |
Usecase on Windows: your japanese friend gives you an USB key (eg. created on Windows with code page 932) with his Python project, you cannot run it on your english speaking Windows (eg. code page 1252), because it loads Python modules with japanese characters in their paths. It works if all paths are encodable to your ANSI code page. It doesn't work if a least one character of one path is not encodable to your ANSI code page. I don't know if this usecase is common or not. Note: the FAT file system of the USB key stores filenames as UTF-16 (and not in the user code page). |
Issue bpo-10785 prepares the work for this issue: store input filename as a unicode string, instead of a byte string, in the parser. |
If I edit a file with IDLE, save it, and successfully run it (perhaps to test it), then when I edit a second file that imports the first, I expect the import to work. It does not always (see bpo-10828). Import is part of the core definition of the language. Unicode identifiers are supposedly part of Python3. Given the existence of <identifier>.py in the current directory, 'import identifier' should work. If it does not, the 3.1 message '<identifier> not found' is more truthful than the current 'no module named <identifier>', when there is one. The doc says "identifier ::= (identifier ".")* identifier". As long as that is not true, some indication of the restriction that most people can understand would be nice. (And I suspect that a majority of Windows users, at least in the US, have no idea of what an 'ANSI code page' is.) |
Here is a work-in-progress patch: bpo-3080-3.patch. The patch is HUGE and written for Python 3.3. $ diffstat issue3080-3.patch
Doc/c-api/module.rst | 24
Include/import.h | 73 +
Include/moduleobject.h | 2
Include/pycapsule.h | 4
Modules/zipimport.c | 272 +++ Objects/moduleobject.c | 52 - As expected, most of the work in done in import.c. Decode the module name earlier and encode it later. Try to manipulate PyUnicodeObject objects instead of char* buffers (so we have directly the string length). Split the huge and very complex find_module() function into 3 functions (find_module, find_module_filename and find_module2) and document them. Drop OS/2 support in find_module() (it can be kept, but it was easier for me to drop it and the OS/2 maintainer wrote that Python 3 is far from being compatible with OS/2). The patch creates some functions: PyModule_GetNameObject(), PyImport_ExecCodeModuleUnicode(), PyImport_AddModuleUnicode(), PyImport_ImportFrozenModuleUnicode(), PyModule_NewUnicode(), ... Use "U" format to parse a module name, and "%R" to format a module name (to escape surrogates characters and add quotes, instead of "... '%.200s' ..."). PyWin_FindRegisteredModule() is now private. Remove fqname argument from _PyImport_GetDynLoadFunc(), it wasn't used. Replace open_exclusive() by fopen(name, "wb") on Windows: is it correct? TODO:
The patch contains a tiny script, bpo-3080.py, to test the patch using an ISO-8859-1 locale. I will open a thread on the mailing list (python-dev) to decide if this patch is needed or not. If we agree that this issue should be fixed, I will split the patch into smaller parts and start a review process. |
Victor, could you please create a Reitveld review for this? The auto-review creator can't cope with the Git diffs. |
Yes, but not yet. I have first to cleanup the patch. |
OK - I'll wait until that is ready before digging into this. |
See also bpo-8754: repr() is better than str() for other reasons, eg. to see a space at the end of a module name (import('space ')) thanks to the quotes. |
Version 4 of the patch. |
Same patch (version 4) generated by svn. |
You can review the patch with Rietveld: |
Oops, there is a dummy typo in imp_init_builtin() that makes test_importlib to crash (which proves that importlib has a good coverage :-)): replace "s:" by "U:" in if (!PyArg_ParseTuple(args, "s:init_builtin", &name)). |
test_reprlib fails on Windows, because '\' in quoted '\\' in the filename on repr(module). Workaround: *******
index b0dc4d7..e476941 100644
--- a/Lib/test/test_reprlib.py
+++ b/Lib/test/test_reprlib.py
@@ -234,7 +234,7 @@ class LongReprTest(unittest.TestCase):
touch(os.path.join(self.subpkgname, self.pkgname + '.py'))
from areallylongpackageandmodulenametotestreprtruncation.areallylongpackageandmodulenametotestreprtruncation import areallylongpackageandmodulenametotestreprtruncation
eq(repr(areallylongpackageandmodulenametotestreprtruncation),
- "<module '%s' from '%s'>" % (areallylongpackageandmodulenametotestreprtruncation.__name__, areallylongpackageandmodulenametotestreprtruncation.__file__))
+ "<module %r from %r>" % (areallylongpackageandmodulenametotestreprtruncation.__name__, areallylongpackageandmodulenametotestreprtruncation.__file__))
eq(repr(sys), "<module 'sys' (built-in)>")
def test_type(self):
******* It is maybe not a good idea to use %R to format the filename in module.__repr__(). |
test_runpy fails on Windows on make_legacy_pyc() (of test.support), I don't know why. |
After applying the patch, doing a make clean and rebuild, I found that test_importlib fails with a segmentation fault, but the default test suite otherwise runs without error (that's on Linux with a UTF-8 filesystem, though). I'll see how a -uall run fares. |
As for the more limited run, I get a clean run with -uall except for the segfault in test_importlib. I'll switch to a pydebug build and see how a verbose run of that test fares. |
I haven't investigated in detail yet, but this is the final line showing the failing test: test_module (importlib.test.builtin.test_loader.LoaderTests) ... Segmentation fault |
Yes, as reported in my previous comment :-) Let's update the patch for practical reasons. But I don't want to touch http://codereview.appspot.com/1874048 (based on patch version 4). |
Oops, missed that post - that was indeed the problem. With that fixed, tests are all good on this system. I'll give the patch a look anyway, but I'm going to have trouble diagnosing things that don't fail on my development machine. As far as the test_reprlib failure goes, I seem to recall addressing a similar problem elsewhere in the standard lib by replace a "%r" code with "'%s'" to get the single quotes without the backslash escaping. A similar change should probably do the trick here. |
I started to commit some parts of the huge patch: r88515: Mark PyWin_FindRegisteredModule() as private |
r88519: Mark _PyImport_FindBuiltin() argument as constant |
This new failure is perhaps related: ====================================================================== Traceback (most recent call last):
File "c:\buildslave-py3k\3.x.curtin-win2008-amd64\build\lib\test\test_reprlib.py", line 237, in test_module
"<module '%s' from '%s'>" % (areallylongpackageandmodulenametotestreprtruncation.__name__, areallylongpackageandmodulenametotestreprtruncation.__file__))
AssertionError: "<module 'areallylongpackageandmodulenametotestreprtruncation.areallylongpackage [truncated]... != "<module 'areallylongpackageandmodulenametotestreprtruncation.areallylongpackage [truncated]...
Diff is 825 characters long. Set self.maxDiff to None to see it. |
Ah yes, yesterday, I tried to remember which test was impacted by the module change, but all tests passed on Linux. Anyway, it's now fixed by r88533. |
r88746: Add PyModule_NewObject() function |
I created the features/unicode_import repository with a "unicode_import" branch: It's my huge patch splitted into small and atomic commits. |
Nice work! Is there a specific place for comments? Here are some of them already:
pathsize = PyUnicode_GET_SIZE(prefix) + PyUnicode_GET_SIZE(name);
result = PyUnicode_FromUnicode(NULL, pathsize);
path = PyUnicode_AS_UNICODE(ret);
...
return result;
lastdot = Py_UNICODE_strrchr(nameuni, '.');
if (lastdot == NULL)
shortname = namenuni;
else:
shortname = lastdot + 1;
|
Yes, but my work is not done. I still have parts to commit.
Oh. This one is not easy because this function has many implementations and all implementations have the same prototype. I will maybe fix it later. |
I finished to split the huge patch into smaller commits. You can now test the unicode_import Mercurial branch. Especially, it should be tested on Windows. I don't know if I should merge the branch as an unique commit or as multiple commits. Some of them can be simply be merged. You can try bpo-3080.py (file attached to this issue, extracted from the patch): a short script testing this issue. -- The parser and _PyImport_GetDynLoadFunc() (on Windows) do still store the filename as byte strings, and so I don't think that Python is ready to use full Unicode range for filenames on Windows. But at least, it should now support non-ASCII module names and paths which are encodable to the ANSI code page. Issue bpo-10785 should improve the situation at least for the parser. But for _PyImport_GetDynLoadFunc(), I don't know if there is a Unicode version of LoadLibraryEx(). --
Implemented in f286d3b514e0.
Done in 76907d413b99 |
I reverted this change in my Mercurial branch (unicode_import).
done
done
done
done: find_module_path_list() and find_module_path() |
Gotcha: I replaced mkdir() by CreateDirectoryW(), but the "directory already exists" error was not ignored. Fixed by 2debe178697b. |
New changeset 6c80ac44ae9c by Victor Stinner in branch 'default': New changeset b50a0d44545a by Victor Stinner in branch 'default': New changeset e7c1019b27b9 by Victor Stinner in branch 'default': New changeset 2425717c6430 by Victor Stinner in branch 'default': New changeset ced52fcd95f6 by Victor Stinner in branch 'default': New changeset e63a583ec689 by Victor Stinner in branch 'default': New changeset bab42673674a by Victor Stinner in branch 'default': New changeset ef2b6305d395 by Victor Stinner in branch 'default': New changeset d52f471fbbeb by Victor Stinner in branch 'default': New changeset bdf5820f5a39 by Victor Stinner in branch 'default': New changeset a4d797b9ff63 by Victor Stinner in branch 'default': New changeset 09aaac73d9cf by Victor Stinner in branch 'default': New changeset f6507eb8e689 by Victor Stinner in branch 'default': New changeset d24decc8c97e by Victor Stinner in branch 'default': New changeset 64c21f364519 by Victor Stinner in branch 'default': New changeset e55e7f197649 by Victor Stinner in branch 'default': New changeset 7c67aa3ab531 by Victor Stinner in branch 'default': New changeset 23fe237afa81 by Victor Stinner in branch 'default': New changeset 2ee0ab9d2e8a by Victor Stinner in branch 'default': New changeset 340f76a6a792 by Victor Stinner in branch 'default': New changeset 156818529636 by Victor Stinner in branch 'default': New changeset fe1d421ca3fa by Victor Stinner in branch 'default': New changeset c1a5a7dca1ec by Victor Stinner in branch 'default': New changeset c4ccf02456d6 by Victor Stinner in branch 'default': New changeset 298a70b27497 by Victor Stinner in branch 'default': New changeset 066b399a8477 by Victor Stinner in branch 'default': New changeset 9aec6f0e4076 by Victor Stinner in branch 'default': New changeset c17bc2026145 by Victor Stinner in branch 'default': New changeset c4361bab6914 by Victor Stinner in branch 'default': New changeset 80f4bd647695 by Victor Stinner in branch 'default': New changeset cc7c0f6f60bf by Victor Stinner in branch 'default': |
New changeset f8d6f6797909 by Victor Stinner in branch 'default': |
New changeset dc38c4d65cd9 by Victor Stinner in branch 'default': |
http://www.python.org/dev/buildbot/all/builders/PPC%20Tiger%203.x/builds/1599/steps/test/logs/stdio ====================================================================== Traceback (most recent call last):
File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/test_importhooks.py", line 239, in testImpWrapper
m = __import__(mname, globals(), locals(), ["__dummy__"])
File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/test_importhooks.py", line 132, in load_module
mod = imp.load_module(fullname, self.file, self.filename, self.stuff)
File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/distutils/core.py", line 19, in <module>
from distutils.cmd import Command
File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/test_importhooks.py", line 132, in load_module
mod = imp.load_module(fullname, self.file, self.filename, self.stuff)
File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/distutils/cmd.py", line 11, in <module>
from distutils import util, dir_util, file_util, archive_util, dep_util
File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/test_importhooks.py", line 132, in load_module
mod = imp.load_module(fullname, self.file, self.filename, self.stuff)
File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/distutils/dir_util.py", line 8, in <module>
import errno
File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/test_importhooks.py", line 132, in load_module
mod = imp.load_module(fullname, self.file, self.filename, self.stuff)
TypeError: 'NoneType' object is not iterable |
The problem is that imp.find_module() now returns None as the filename, but imp.load_module() doesn't support None. |
Attached patch fixes a typo in Doc/c-api/import.rst. You can merge it in your next commit. |
New changeset 7f4a4e393058 by Victor Stinner in branch 'default': |
Ok. Python 3.3 does now support non-ASCII characters in module paths and names on Windows, but only characters encodable to the ANSI code page. To support the full Unicode range, we should remove all calls to PyUnicode_EncodeFSDefault() on Windows: a) parse_source_module() has to encode the filename because the parser has no function expecting a filename as a Python object. It uses currently PyParser_ASTFromFile(). b) write_compiled_module() encodes the filename to call open_exclusive(). I don't know how to implement open_exclusive() for Windows using Unicode filename: open() expects the filename as a byte string. Can we use _Py_fopen() (_wfopen)? Or do you need the O_EXCL flag? c) _PyImport_LoadDynamicModule() encodes the filename for _PyImport_GetDynLoadFunc(). The prototype should be changed, but only on Windows, to accept a filename as a Unicode string. Issue bpo-10785 is the right fix to (a). When bpo-10785 will be fixed, it will be easier to fix bpo-9319 crash. |
Hum, the difficult part is to use Unicode in _PyImport_GetDynLoadFunc() for: hDLL = LoadLibraryEx(pathname, NULL, LOAD_WITH_ALTERED_SEARCH_PATH); There is a LoadLibraryW() function, but it doesn't have a flag argument. And I suppose that the LOAD_WITH_ALTERED_SEARCH_PATH option is important. |
Ok, I think that the most important part is now implemented in Python 3.3: use Unicode for module names and paths in the import machinery. Remaing parts are specific to Windows, and so I opened a new issue: bpo-11619. Let's close this 3 years old issue. |
New changeset ee4e780a6b7a by Éric Araujo in branch 'default': |
As I see Victor has dropped OS/2 support from Python/import.c |
340f76a6a792 just removes few lines in import.c: they can easily be rewritten. And this commit doesn't drop completly the support of OS/2 from the import machinery, as you wrote: dynload_os2.c still exists. If we drop completly the support of OS/2, it should be done completly using a PEP (I don't remember its number), and it should be discussed. At least with Andrew I MacIntyre :-) |
Understood. Sorry. Anyway please don't care about that. |
New changeset 15f9eca5e956 by Victor Stinner in branch 'default': |
test the fixed nosy list |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: