classification
Title: imp.find_module() fails with a UnicodeDecodeError when called with non-ASCII search paths
Type: behavior Stage:
Components: Interpreter Core, Library (Lib), Unicode, Windows Versions: Python 3.0, Python 3.1
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Jukka Aho, Serg.Asminog, amaury.forgeotdarc, asvetlov, benjamin.peterson, gvanrossum, vstinner
Priority: normal Keywords:

Created on 2008-11-19 05:17 by Jukka Aho, last changed 2011-12-09 14:55 by Serg.Asminog. This issue is now closed.

Files
File name Uploaded Description Edit
find_module.py Jukka Aho, 2008-11-19 05:17
test.py Serg.Asminog, 2011-12-09 12:05
Messages (20)
msg76038 - (view) Author: Jukka Aho (Jukka Aho) Date: 2008-11-19 05:17
imp.find_module() seems to cause an UnicodeDecodeError when the path
list contains paths with non-ASCII names. Tested on Windows [1]; see the
attached test case which demonstrates the problem.

[1] Python 3.0rc2 (r30rc2:67141, Nov  7 2008, 11:43:46) [MSC v.1500 32
bit (Intel)] on win32
msg76045 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-11-19 13:13
The example works correctly on Linux (py3k trunk). The problem is maybe 
specific to Windows?
msg76048 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-11-19 13:52
Indeed. It happens when the filesystem encoding is not utf-8.

I have several changes in my local workspace about this, which also deal
with zipimport and other places that import modules.
I suggest to let 3.0 go out and correct all this for 3.1.
msg83825 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-03-19 22:59
> Indeed. It happens when the filesystem encoding is not utf-8.

How can I test it on Linux?
msg83828 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-03-19 23:06
Oh, I found sys.setfilesystemencoding("latin-1")! But even with that, 
your example find_module.py works correctly with py3k trunk. The 
problem has maybe gone?
msg83829 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2009-03-19 23:10
Well, latin-1 can decode any arbitrary array of bytes, so of course it
won't fail. :)
msg84459 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2009-03-30 02:28
I can reproduce this problem on Windows Vista, fresh py3k sources.
Looks like bug occurs only with Latin-1 characters.
At least Cyrillic works ok.
msg84512 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2009-03-30 05:43
From my understanding (after tracing/debugging) problem lies in import.c
find_module tries to convert path from unicode to bytestring using Py_FileSystemDefaultEncoding (line 1397). For Windows it is 'mbcs'.

Conversion done with decode_mbcs (unicodeobject.c:4244) what uses MultiByteToWideChar with codepage CP_ACP. Problem is: converting 
composite characters ('\u00e4' is 'a'+'2 dots over letter', I don't know 
true name for this sign) this function returns only 'a'.

>>> repr('h\u00e4kkinen'.encode('mbcs'))
"b'hakkinen'"

MSDN says (http://msdn.microsoft.com/en-
us/library/dd374130(VS.85).aspx):
For strings that require validation, such as file, resource, and user 
names, the application should always use the WC_NO_BEST_FIT_CHARS flag 
with WideCharToMultiByte. This flag prevents the function from mapping 
characters to characters that appear similar but have very different 
semantics. In some cases, the semantic change can be extreme. For 
example, the symbol for "∞" (infinity) maps to 8 (eight) in some code 
pages.

Writing encoding function in opposite to PyUnicode_DecodeFSDefault with 
setting this flag also cannot help - problematic character just replaced 
with 'default' ('?' if not specified).
Hacking specially for 'latin-1' encoding sounds ugly.

Changing all filenames to unicode (with possible usage of fileio instead 
of direct calls of open/fdopen) in import.c looks good for me but takes 
long time and makes many changes.
msg84547 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2009-03-30 14:21
At the sprint, Andrew Svetlov, Martin von Loewis and I looked into this
a bit, and discovered that Andrew's Vista copy uses a Russian locale for
the filesystem encoding (despite using English as the language).  In
this locale, a-umlaut cannot be represented in the ANSI code page (which
has only 256 values), because the Russian locale uses those byte values
to represent Cyrillic.

As long as the import code (written in C) uses bytes in the filesystem
encoding to represent paths, this problem will remain.

Two possible solutions would be to switch to Brett's importlib, or to
change the import code to use wide characters everywhere (like
posixmodule.c).  Both are extremely risky and a lot of work, and I don't
expect we'll get to this for 3.1.

(In 2.x the same problem exists, but is perhaps less real because module
names are limited to ASCII.)

We also discovered another problem, which I'll report separately: the
*module* name is decoded to UTF8, while the *path* name uses the
filesystem encoding...
msg106096 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-19 20:45
See also #8611.
msg107833 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-06-14 22:50
About the mbcs encoding: issue #850997 proposes to make it more strict.
msg108150 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-06-18 23:37
I closed issue #850997, mbcs is now really strict by default:

>>> 'h\u00e4kkinen'.encode('mbcs')
UnicodeEncodeError: ...
>>> 'h\u00e4kkinen'.encode('mbcs', 'replace')
"b'hakkinen'"

PyUnicode_EncodeFSDefault(), PyUnicode_DecodeFSDefault() and os.fsencode() use mbcs with strict error handler on Windows. On other OS, these functions use surrogateescape error handler, but mbcs only supports strict and replace (to encode, and strict and ignore to decode).
msg112030 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-07-30 00:22
I wrote a patch to fix this issue, see #9425.
msg118981 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-17 20:37
Good news: this issue is now fixed in py3k (Python 3.2). I cannot give a commit number, because there are too much commits related to this problem (see #8611 and #9425), but it works ;-)
msg149087 - (view) Author: Serg Asminog (Serg.Asminog) Date: 2011-12-09 12:05
dirname = 'A-Za-z\xc4\xd6\xdc\xe4\xf6\xfc\xdf'

Traceback (most recent call last):
  File "D:\temp\python bug\test.py", line 19, in <module>
    file_object, file_path, description = imp.find_module(basename, [dirname])
UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character
msg149088 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-12-09 12:58
@Serg Asminog: What is your Python version? What is your locale encoding (print(sys.getfilesystemencoding())? What is your Windows version?
msg149097 - (view) Author: Serg Asminog (Serg.Asminog) Date: 2011-12-09 14:11
print(sys.getfilesystemencoding())
print(os.name)
print(sys.version)
print(sys.version_info)
print(sys.platform)

-----
mbcs
nt
3.2.2 (default, Sep  4 2011, 09:07:29) [MSC v.1500 64 bit (AMD64)]
sys.version_info(major=3, minor=2, micro=2, releaselevel='final', serial=0)
win32

-----------
Windows 7 64bit
msg149098 - (view) Author: Serg Asminog (Serg.Asminog) Date: 2011-12-09 14:16
Also 

Traceback (most recent call last):
  File "D:\temp\python bug\test.py", line 20, in <module>
    file_object, file_path, description = imp.find_module(basename, [dirname])
ImportError: No module named mymodule

with python  2.6.6 (r266:84297, Aug 24 2010, 18:13:38) [MSC v.1500 64 bit (AMD64)]
msg149099 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-12-09 14:18
Oops, it's not sys.getfilesystemencoding(), but locale.getpreferredencoding() which is interesting. Can you give me your locale encoding?
msg149102 - (view) Author: Serg Asminog (Serg.Asminog) Date: 2011-12-09 14:55
cp1251
History
Date User Action Args
2011-12-09 14:55:44Serg.Asminogsetmessages: + msg149102
2011-12-09 14:18:36vstinnersetmessages: + msg149099
2011-12-09 14:16:17Serg.Asminogsetmessages: + msg149098
2011-12-09 14:11:51Serg.Asminogsetmessages: + msg149097
2011-12-09 12:58:05vstinnersetmessages: + msg149088
2011-12-09 12:05:12Serg.Asminogsetfiles: + test.py
nosy: + Serg.Asminog
messages: + msg149087

2010-10-17 20:37:08vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg118981
2010-07-30 00:22:42vstinnersetmessages: + msg112030
2010-06-18 23:37:42vstinnersetmessages: + msg108150
2010-06-14 22:50:03vstinnersetmessages: + msg107833
2010-05-19 20:45:45vstinnersetmessages: + msg106096
2009-03-30 14:21:42gvanrossumsetnosy: + gvanrossum
messages: + msg84547
2009-03-30 05:43:30asvetlovsetmessages: + msg84512
components: + Interpreter Core
versions: + Python 3.1
2009-03-30 02:28:54asvetlovsetnosy: + asvetlov
messages: + msg84459
2009-03-19 23:10:51benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg83829
2009-03-19 23:06:54vstinnersetmessages: + msg83828
2009-03-19 23:00:00vstinnersetmessages: + msg83825
2008-11-19 13:52:16amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg76048
2008-11-19 13:13:17vstinnersetnosy: + vstinner
messages: + msg76045
2008-11-19 05:42:31Jukka Ahosettitle: imp.find_module() causes UnicodeDecodeError with non-ASCII search paths -> imp.find_module() fails with a UnicodeDecodeError when called with non-ASCII search paths
2008-11-19 05:17:45Jukka Ahocreate