classification
Title: runpy cannot run Unicode path on Windows
Type: behavior Stage:
Components: Library (Lib), Unicode, Windows Versions: Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Drekin, amaury.forgeotdarc, ezio.melotti, vstinner
Priority: normal Keywords:

Created on 2013-03-31 18:08 by Drekin, last changed 2013-08-26 20:39 by vstinner. This issue is now closed.

Messages (8)
msg185634 - (view) Author: Adam Bartoš (Drekin) * Date: 2013-03-31 18:08
runpy.run_path("\u222b.py") raises UnicodeEncodeError when trying to use mbcs codec on Windows. However opening the file using open() is ok. So why is runpy trying to encode the name using mbcs encoding when it's not necessary or even correct? See http://bpaste.net/show/aOqQLMyYAAFTJ8pQnkli/ .
msg185856 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2013-04-02 20:27
The issue is actually with compile():
  compile('x=1', '\u222b.py', 'exec')
fails on my Western Windows machine (mbcs = cp1252).
This conversion should not be necessary, since the filename is only used for error messages (and decoded again!)

But unfortunately the various API functions used by compile() are documented to take a filename encoded with the filesystem encoding:
http://docs.python.org/dev/c-api/veryhigh.html#Py_CompileStringExFlags
This API is unfortunate; on Windows Python should never have to convert filenames unless bytes strings are explicitly used.

I can see two ways to fix the issue:
- build another set of APIs which take unicode strings for the filename, or at least encoded to UTF-8.
- use some trick for unencodable filenames; filename.encode('mbcs', 'backslashreplace') works, but does not round-trip (and cannot fetch source code in tracebacks). I don't know if there is some variant of surrogateescape that we could use.
msg185857 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2013-04-02 20:46
I have a similar issue with a directory '∫' ('\u222b') containing a file foo.py:

>>> sys.path.insert(0, '\u222b')
>>> import foo
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<frozen importlib._bootstrap>", line 1564, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1531, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 586, in _check_name_wrapper
  File "<frozen importlib._bootstrap>", line 1023, in load_module
  File "<frozen importlib._bootstrap>", line 1004, in load_module
  File "<frozen importlib._bootstrap>", line 562, in module_for_loader_wrapper
  File "<frozen importlib._bootstrap>", line 854, in _load_module
  File "<frozen importlib._bootstrap>", line 981, in get_code
  File "<frozen importlib._bootstrap>", line 313, in _call_with_frames_removed
UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character

(I got this traceback with "python -v")
line 981 contains a call to compile().
msg185858 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-02 20:50
This issue is a duplicate of the issue #11619. In short: when importing a Python module, Python 3.3 only supports paths encodable to the ANSI code page. The issue #11619 contains an huge patch to support *any* Unicode character in module path. I closed the issue because I consider that nobody needs such feature :-)

What is your usecase? Do you really need to support ∫ as *Python* module name or a Python script filename? Is Windows able to display this character at least?
msg185861 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2013-04-02 21:45
No need to use weird characters. Greek or Cyrillic letters are enough.
Suppose I download a library with language modules such as Русский.py or Ελληνικά.py; they are allowed as identifiers and can be regularly imported... on utf8 system at least.

Actually such a project already exists: https://code.google.com/p/hellenic-language-toolkit/
"svn co" will correctly create files (win32 explorer show correct names); when importing from IDLE, I get 

>>> import HLT
Traceback (most recent call last):
  File "<pyshell#0>", line 1, in <module>
    import HLT
  File ".\HLT.py", line 29, in <module>
    from Ελληνικά.Ελληνικά import Ελληνικά
UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character
msg185904 - (view) Author: Adam Bartoš (Drekin) * Date: 2013-04-03 09:36
I have no specific use case. I just thought that runpy.run_path should work similarily as if the file was run directly (which works).

File ∫.py can be created, displayed and run by Python with no problem in Windows.
msg195955 - (view) Author: Adam Bartoš (Drekin) * Date: 2013-08-23 09:08
There is over year old closely related issue: http://bugs.python.org/issue13758 .
msg196246 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-08-26 20:38
This issue has been fixed in issue #11619 by:

New changeset df2fdd42b375 by Victor Stinner in branch 'default':
Close #11619: The parser and the import machinery do not encode Unicode
http://hg.python.org/cpython/rev/df2fdd42b375

Thanks for the report!

(I don't plan to backport the fix to Python 3.3, it's a huge patch for a rare use case.)
History
Date User Action Args
2013-08-26 20:39:17vstinnersetstatus: open -> closed
resolution: fixed
2013-08-26 20:38:34vstinnersetmessages: + msg196246
2013-08-23 09:08:16Drekinsetmessages: + msg195955
2013-04-03 09:36:05Drekinsetmessages: + msg185904
2013-04-02 21:45:33amaury.forgeotdarcsetmessages: + msg185861
2013-04-02 20:50:36vstinnersetversions: + Python 3.4, - Python 3.3
2013-04-02 20:50:24vstinnersetnosy: + vstinner
messages: + msg185858
2013-04-02 20:46:30amaury.forgeotdarcsetmessages: + msg185857
2013-04-02 20:27:32amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg185856
2013-03-31 18:08:50Drekincreate