Issue 4352: imp.find_module() fails with a UnicodeDecodeError when called with non-ASCII search paths

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/48602

classification

Title:	imp.find_module() fails with a UnicodeDecodeError when called with non-ASCII search paths
Type:	behavior	Stage:
Components:	Interpreter Core, Library (Lib), Unicode, Windows	Versions:	Python 3.0, Python 3.1

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	Jukka Aho, Serg.Asminog, amaury.forgeotdarc, asvetlov, benjamin.peterson, gvanrossum, vstinner
Priority:	normal	Keywords:

Created on 2008-11-19 05:17 by Jukka Aho, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
find_module.py	Jukka Aho, 2008-11-19 05:17
test.py	Serg.Asminog, 2011-12-09 12:05

Messages (20)
msg76038 - (view)	Author: Jukka Aho (Jukka Aho)	Date: 2008-11-19 05:17
imp.find_module() seems to cause an UnicodeDecodeError when the path list contains paths with non-ASCII names. Tested on Windows [1]; see the attached test case which demonstrates the problem. [1] Python 3.0rc2 (r30rc2:67141, Nov 7 2008, 11:43:46) [MSC v.1500 32 bit (Intel)] on win32
msg76045 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-11-19 13:13
The example works correctly on Linux (py3k trunk). The problem is maybe specific to Windows?
msg76048 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2008-11-19 13:52
Indeed. It happens when the filesystem encoding is not utf-8. I have several changes in my local workspace about this, which also deal with zipimport and other places that import modules. I suggest to let 3.0 go out and correct all this for 3.1.
msg83825 - (view)	Author: STINNER Victor (vstinner) *	Date: 2009-03-19 22:59
> Indeed. It happens when the filesystem encoding is not utf-8. How can I test it on Linux?
msg83828 - (view)	Author: STINNER Victor (vstinner) *	Date: 2009-03-19 23:06
Oh, I found sys.setfilesystemencoding("latin-1")! But even with that, your example find_module.py works correctly with py3k trunk. The problem has maybe gone?
msg83829 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2009-03-19 23:10
Well, latin-1 can decode any arbitrary array of bytes, so of course it won't fail. :)
msg84459 - (view)	Author: Andrew Svetlov (asvetlov) *	Date: 2009-03-30 02:28
I can reproduce this problem on Windows Vista, fresh py3k sources. Looks like bug occurs only with Latin-1 characters. At least Cyrillic works ok.
msg84512 - (view)	Author: Andrew Svetlov (asvetlov) *	Date: 2009-03-30 05:43
From my understanding (after tracing/debugging) problem lies in import.c find_module tries to convert path from unicode to bytestring using Py_FileSystemDefaultEncoding (line 1397). For Windows it is 'mbcs'. Conversion done with decode_mbcs (unicodeobject.c:4244) what uses MultiByteToWideChar with codepage CP_ACP. Problem is: converting composite characters ('\u00e4' is 'a'+'2 dots over letter', I don't know true name for this sign) this function returns only 'a'. >>> repr('h\u00e4kkinen'.encode('mbcs')) "b'hakkinen'" MSDN says (http://msdn.microsoft.com/en- us/library/dd374130(VS.85).aspx): For strings that require validation, such as file, resource, and user names, the application should always use the WC_NO_BEST_FIT_CHARS flag with WideCharToMultiByte. This flag prevents the function from mapping characters to characters that appear similar but have very different semantics. In some cases, the semantic change can be extreme. For example, the symbol for "∞" (infinity) maps to 8 (eight) in some code pages. Writing encoding function in opposite to PyUnicode_DecodeFSDefault with setting this flag also cannot help - problematic character just replaced with 'default' ('?' if not specified). Hacking specially for 'latin-1' encoding sounds ugly. Changing all filenames to unicode (with possible usage of fileio instead of direct calls of open/fdopen) in import.c looks good for me but takes long time and makes many changes.
msg84547 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2009-03-30 14:21
At the sprint, Andrew Svetlov, Martin von Loewis and I looked into this a bit, and discovered that Andrew's Vista copy uses a Russian locale for the filesystem encoding (despite using English as the language). In this locale, a-umlaut cannot be represented in the ANSI code page (which has only 256 values), because the Russian locale uses those byte values to represent Cyrillic. As long as the import code (written in C) uses bytes in the filesystem encoding to represent paths, this problem will remain. Two possible solutions would be to switch to Brett's importlib, or to change the import code to use wide characters everywhere (like posixmodule.c). Both are extremely risky and a lot of work, and I don't expect we'll get to this for 3.1. (In 2.x the same problem exists, but is perhaps less real because module names are limited to ASCII.) We also discovered another problem, which I'll report separately: the module name is decoded to UTF8, while the path name uses the filesystem encoding...
msg106096 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-05-19 20:45
See also #8611.
msg107833 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-06-14 22:50
About the mbcs encoding: issue #850997 proposes to make it more strict.
msg108150 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-06-18 23:37
I closed issue #850997, mbcs is now really strict by default: >>> 'h\u00e4kkinen'.encode('mbcs') UnicodeEncodeError: ... >>> 'h\u00e4kkinen'.encode('mbcs', 'replace') "b'hakkinen'" PyUnicode_EncodeFSDefault(), PyUnicode_DecodeFSDefault() and os.fsencode() use mbcs with strict error handler on Windows. On other OS, these functions use surrogateescape error handler, but mbcs only supports strict and replace (to encode, and strict and ignore to decode).
msg112030 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-07-30 00:22
I wrote a patch to fix this issue, see #9425.
msg118981 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-10-17 20:37
Good news: this issue is now fixed in py3k (Python 3.2). I cannot give a commit number, because there are too much commits related to this problem (see #8611 and #9425), but it works ;-)
msg149087 - (view)	Author: Serg Asminog (Serg.Asminog)	Date: 2011-12-09 12:05
dirname = 'A-Za-z\xc4\xd6\xdc\xe4\xf6\xfc\xdf' Traceback (most recent call last): File "D:\temp\python bug\test.py", line 19, in <module> file_object, file_path, description = imp.find_module(basename, [dirname]) UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character
msg149088 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-12-09 12:58
@Serg Asminog: What is your Python version? What is your locale encoding (print(sys.getfilesystemencoding())? What is your Windows version?
msg149097 - (view)	Author: Serg Asminog (Serg.Asminog)	Date: 2011-12-09 14:11
print(sys.getfilesystemencoding()) print(os.name) print(sys.version) print(sys.version_info) print(sys.platform) ----- mbcs nt 3.2.2 (default, Sep 4 2011, 09:07:29) [MSC v.1500 64 bit (AMD64)] sys.version_info(major=3, minor=2, micro=2, releaselevel='final', serial=0) win32 ----------- Windows 7 64bit
msg149098 - (view)	Author: Serg Asminog (Serg.Asminog)	Date: 2011-12-09 14:16
Also Traceback (most recent call last): File "D:\temp\python bug\test.py", line 20, in <module> file_object, file_path, description = imp.find_module(basename, [dirname]) ImportError: No module named mymodule with python 2.6.6 (r266:84297, Aug 24 2010, 18:13:38) [MSC v.1500 64 bit (AMD64)]
msg149099 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-12-09 14:18
Oops, it's not sys.getfilesystemencoding(), but locale.getpreferredencoding() which is interesting. Can you give me your locale encoding?
msg149102 - (view)	Author: Serg Asminog (Serg.Asminog)	Date: 2011-12-09 14:55
cp1251

History
Date	User	Action	Args
2022-04-11 14:56:41	admin	set	github: 48602
2011-12-09 14:55:44	Serg.Asminog	set	messages: + msg149102
2011-12-09 14:18:36	vstinner	set	messages: + msg149099
2011-12-09 14:16:17	Serg.Asminog	set	messages: + msg149098
2011-12-09 14:11:51	Serg.Asminog	set	messages: + msg149097
2011-12-09 12:58:05	vstinner	set	messages: + msg149088
2011-12-09 12:05:12	Serg.Asminog	set	files: + test.py nosy: + Serg.Asminog messages: + msg149087
2010-10-17 20:37:08	vstinner	set	status: open -> closed resolution: fixed messages: + msg118981
2010-07-30 00:22:42	vstinner	set	messages: + msg112030
2010-06-18 23:37:42	vstinner	set	messages: + msg108150
2010-06-14 22:50:03	vstinner	set	messages: + msg107833
2010-05-19 20:45:45	vstinner	set	messages: + msg106096
2009-03-30 14:21:42	gvanrossum	set	nosy: + gvanrossum messages: + msg84547
2009-03-30 05:43:30	asvetlov	set	messages: + msg84512 components: + Interpreter Core versions: + Python 3.1
2009-03-30 02:28:54	asvetlov	set	nosy: + asvetlov messages: + msg84459
2009-03-19 23:10:51	benjamin.peterson	set	nosy: + benjamin.peterson messages: + msg83829
2009-03-19 23:06:54	vstinner	set	messages: + msg83828
2009-03-19 23:00:00	vstinner	set	messages: + msg83825
2008-11-19 13:52:16	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarc messages: + msg76048
2008-11-19 13:13:17	vstinner	set	nosy: + vstinner messages: + msg76045
2008-11-19 05:42:31	Jukka Aho	set	title: imp.find_module() causes UnicodeDecodeError with non-ASCII search paths -> imp.find_module() fails with a UnicodeDecodeError when called with non-ASCII search paths
2008-11-19 05:17:45	Jukka Aho	create