Issue 5604: imp.find_module() mixes UTF8 and MBCS

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/49854

classification

Title:	imp.find_module() mixes UTF8 and MBCS
Type:	behavior	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.1, Python 3.2

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	ezio.melotti	Nosy List:	asvetlov, brett.cannon, ezio.melotti, flox, gvanrossum, loewis, ncoghlan, ocean-city
Priority:	normal	Keywords:	buildbot

Created on 2009-03-30 14:26 by gvanrossum, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
import.zip	asvetlov, 2009-03-30 19:48	patch for support non-ascii characters in module names. With unittest
import.zip	asvetlov, 2009-03-30 20:42
import.zip	asvetlov, 2009-03-31 19:07	updated patch for fixing error messages in generated exception
import_patch_4th_edition.zip	asvetlov, 2009-04-05 04:58	DONT APPLY IT. Works only for windows now.

Messages (15)
msg84548 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2009-03-30 14:26
There's a path in imp.find_module that mixes encodings. The module name is encoded to char* using UTF-8 by the 's' format passed to PyArg_ParseTuple(). But the path name is converted (in the loop over the path in find_module()) to char* using the filesystem encoding. On Windows this ends up constructing a char* that mixes MBCS and UTF8 in one string. (We discovered this when researching bug 4352, but this is not the cause of the problem reported there -- the module name in that bug is ASCII.) Andrew Svetlov is looking into producing a patch.
msg84626 - (view)	Author: Andrew Svetlov (asvetlov) *	Date: 2009-03-30 19:48
Problem fixed, patch attached I inserted conversion path parameters to using Py_FileSystemDefaultEncoding for: * load_module * load_compiled * load_dynamic * load_source * load_package find_module is already has conversion.
msg84634 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2009-03-30 20:13
PyMem_Free is needed when "es" is used with PyArg_ParseTuple. See other part of import.c. I did same mistake before. ;-)
msg84636 - (view)	Author: Andrew Svetlov (asvetlov) *	Date: 2009-03-30 20:20
Thank you. On Mon, Mar 30, 2009 at 3:13 PM, Hirokazu Yamamoto <report@bugs.python.org> wrote: > > Hirokazu Yamamoto <ocean-city@m2.ccsnet.ne.jp> added the comment: > > PyMem_Free is needed when "es" is used with PyArg_ParseTuple. See other > part of import.c. I did same mistake before. ;-) > > ---------- > nosy: +ocean-city > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue5604> > _______________________________________ >
msg84644 - (view)	Author: Andrew Svetlov (asvetlov) *	Date: 2009-03-30 20:42
According to Hirokazu Yamamoto memory cleanup added. Patch is updated.
msg84663 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2009-03-30 21:49
Thanks Andrew! Committed to 3.0.2 as 70756. Should be merged into 3.1, but should not be backported to 2.x.
msg84862 - (view)	Author: Andrew Svetlov (asvetlov) *	Date: 2009-03-31 19:07
Continuing work over import.c I fixed bad error message encoding in generated ImportError exception. Tests for checking in case of non-ascii characters added.
msg84865 - (view)	Author: Andrew Svetlov (asvetlov) *	Date: 2009-03-31 19:09
Martin von Loewis added to nosy list
msg85220 - (view)	Author: Andrew Svetlov (asvetlov) *	Date: 2009-04-02 16:27
Martin, can you review latest patch and apply it if this one is correct. I want to start working on conversion import.c to use unicode strings (we spoke about Tuesday) this weekend. It will be nice if I will have synchronized svn before making new changes.
msg85464 - (view)	Author: Andrew Svetlov (asvetlov) *	Date: 2009-04-05 04:58
Continuing work on problem I figured out: * on Windows it's impossible to convert filenames to file system encoding without and don't miss something. * Windows can work properly only with unicode (wchar_t) characters. * all other systems feels itself good using utf-8 (or another filesystem encoding). * it's very errorprone to change 'char' to 'PyUnicode'. To solve this problem I assume: * all char* in Python API is utf-8. * if there are need call to operation system api like fopen - call imp_fopen, this function will do need conversions. Inside import.c there are 4 calls: fopen, open_exclusive, unlink, stat. I want to write stubs for this calls. * also loaders for dynamic modules aka 'C extensions' have to expect utf-8 as pathname parameter, not 'filesystem encoded'. Patch for windows is applied (STILL NOT CONVERTED TO OTHER OS). But for Windows it works (regression tests passed). If this solution is applicable for 3.1 (as I know Cannon works on excellent importlib but this library will replace imp functionality only in 3.2) - I can continue switching. Unfortunately I cannot test py3k trunk on non-windows machines - but I can 'make all OS calls as expected' and wait for buildbot answer. Please review import_patch_4th_edition.zip and if I ran in wrong way - let me know.
msg99499 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-02-18 12:09
Still an issue for some buildbot: http://www.python.org/dev/buildbot/all/builders/x86%20XP-4%203.x/builds/1487 http://www.python.org/dev/buildbot/all/builders/x86%20XP-4%203.x/builds/1491 It is loosely related with #7712, because now the tests are run in TEMP (on C:) instead of the build dir (on D:).
msg99503 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-02-18 13:23
Also the test has a few problems: 1) the keys of known_locales are lowercase, but locale_encoding = locale.getpreferredencoding() can return uppercase encodings (e.g. UTF-8); 2) this masks another error: the b'\xe4' is not a valid utf-8 byte and it can be decoded; 3) the test should be skipped properly if the preferred encoding is not among the ones of the known_locales dict; 4) the 'encoded_char' should be 'decoded_char'. It seems that in the failure linked by Florent, find_module tries to encode the filename with the wrong encoding and with error='replace' and the char at the end of 'test_imp_helper_' is converted to U+FFFD. If the file is created with the correct name (e.g. 'test_imp_helper_\xc0') and find_module tries to search the wrong name (i.e. 'test_imp_helper_\ufffd'), then the error is raised (but then cp1252 used by the terminal can't encode that char and the second exception is raised). Now the tests are run on C: and the filesystem encoding might be different, so it might not match anymore the encoding returned by locale.getpreferredencoding().
msg100505 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-03-05 23:45
I fixed all the things listed in the previous message in r78689, but that just enabled the test on several Linux buildbots and some started to fail too. In r78696 (and r78697) I tried to use sys.getfilesystemencoding() instead of locale.getpreferredencoding() and that turned at least the Windows buildbots green, but there are still a few linux bots that fail: http://www.python.org/dev/buildbot/builders/AMD64%20Ubuntu%20wide%203.x/builds/545/steps/test/logs/stdio http://www.python.org/dev/buildbot/builders/AMD64%20Ubuntu%203.x/builds/561/steps/test/logs/stdio http://www.python.org/dev/buildbot/builders/x86%20Ubuntu%203.x/builds/525/steps/test/logs/stdio http://www.python.org/dev/buildbot/builders/amd64%20gentoo%203.x/builds/513/steps/test/logs/stdio
msg100528 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-03-06 14:50
Thanks for fixing this. Now Win7 buildbot is green on trunk.
msg100529 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-03-06 14:54
The Linux buildbots were running the tests using ./python ./Lib/test/regrtest.py instead of ./python -m test.regrtest and '' was missing from sys.path, so imp.find_module couldn't find the module. This is now fixed in r78711 and backported r78716. Thanks to Florent that noticed that those buildbots were using a different command to run the tests.

History
Date	User	Action	Args
2022-04-11 14:56:47	admin	set	github: 49854
2010-03-06 14:54:18	ezio.melotti	set	status: open -> closed assignee: ezio.melotti resolution: fixed messages: + msg100529
2010-03-06 14:50:28	flox	set	messages: + msg100528 stage: needs patch -> resolved
2010-03-05 23:45:40	ezio.melotti	set	nosy: + ncoghlan messages: + msg100505
2010-02-18 14:45:26	flox	set	versions: + Python 3.2, - Python 3.0
2010-02-18 13:24:48	ezio.melotti	set	status: closed -> open resolution: fixed -> (no value)
2010-02-18 13:23:16	ezio.melotti	set	status: open -> closed resolution: fixed messages: + msg99503
2010-02-18 12:09:46	flox	set	status: closed -> open nosy: + ezio.melotti, flox messages: + msg99499 keywords: + buildbot resolution: fixed -> (no value)
2009-04-05 04:59:09	asvetlov	set	files: + import_patch_4th_edition.zip nosy: + brett.cannon messages: + msg85464
2009-04-02 16:27:54	asvetlov	set	messages: + msg85220
2009-03-31 19:09:28	asvetlov	set	nosy: + loewis messages: + msg84865
2009-03-31 19:07:14	asvetlov	set	files: + import.zip messages: + msg84862
2009-03-30 21:49:01	gvanrossum	set	status: open -> closed resolution: fixed messages: + msg84663
2009-03-30 20:42:48	asvetlov	set	files: + import.zip messages: + msg84644
2009-03-30 20:20:42	asvetlov	set	messages: + msg84636
2009-03-30 20:13:02	ocean-city	set	nosy: + ocean-city messages: + msg84634
2009-03-30 19:48:45	asvetlov	set	files: + import.zip messages: + msg84626
2009-03-30 14:26:40	gvanrossum	create