classification
Title: Py_CompileString fails on non decode-able paths.
Type: behavior Stage:
Components: Unicode Versions: Python 3.1, Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: ideasman42, vstinner
Priority: normal Keywords:

Created on 2010-08-30 07:46 by ideasman42, last changed 2010-10-19 02:04 by vstinner. This issue is now closed.

Messages (5)
msg115202 - (view) Author: Campbell Barton (ideasman42) * Date: 2010-08-30 07:46
On linux I have a path which python reads as...

/data/test/num\udce9ro_bad/untitled.blend

os.listdir("/data/test/") returns this ['num\udce9ro_bad']

But the same path cant be given to the C api's Py_CompileString

Where fn is '/data/test/num\udce9ro_bad/untitled.blend/test.py'
 Py_CompileString(buf, fn, Py_file_input);

...gives this error.
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 14-16: invalid data

From this pep, non decode-able paths should use surrogateescape's
http://www.python.org/dev/peps/pep-0383/
msg115282 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-08-31 22:29
The problem is not specific to Py_CompileString(): all functions based (indirectly) on PyParser_ASTFromString() and PyParser_ASTFromFile() expect filenames encoded in utf-8 with the strict error handler.

If we choose to use something else than utf-8 in strict mode, here is an incomplete list of functions that have to be patched:
 - parser:
   * initerr()
   * err_input()
 - ast
   * ast_error_finish()

And the list of impacted functions (parsing functions accepting filenames):
 - PyParser_ParseStringFlagsFilename()
 - PyParser_ParseFile*()
 - PyParser_ASTFromString(), PyParser_ASTFromFile()
 - PyAST_FromNode()
 - PyRun_SimpleFile*()
 - PyRun_AnyFile*()
 - PyRun_InteractiveOneFlags()
 - etc.

All these functions are public and I don't think that it would be a good idea to change the encoding (eg. to iso-8859-1). We can use a different error handler (especially surrogateespace, as suggested in the initial message) and/or create new functions accepting unicode filenames.

--

I'm working on undecodable filenames in issues #8611 and #9425, especially on the import machinery part. When the import machinery will be fully unicode compliant, the last part will be the "parser machinery" (Parser/*.c). It is a little bit more complex to patch the parser because there is the bootstrap problem: the parser is compiled twice, once with a small subset of the C Python API (using some mockups), once with the full API.
msg115943 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-09 12:49
#6543 changed the encoding of the filename argument of PyRun_SimpleFileExFlags() (and all functions based on PyRun_SimpleFileExFlags) and c_filename attribute of the compiler (private) structure in Python 3.1.3: use utf-8 in strict mode instead of filesystem encoding with surrogateescape.
msg118838 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-15 22:26
See also issue #10114.
msg119103 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-19 02:04
See issue #10114: fixed in Python 3.1 (r85716) and in Python 3.2 (r85569+r85570).
History
Date User Action Args
2010-10-19 02:04:37vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg119103
2010-10-15 22:26:46vstinnersetmessages: + msg118838
2010-09-09 12:49:26vstinnersetmessages: + msg115943
2010-08-31 22:30:08vstinnersetcomponents: + Unicode, - None
versions: + Python 3.2
2010-08-31 22:29:51vstinnersetmessages: + msg115282
2010-08-30 12:25:16eric.araujosetnosy: + vstinner
2010-08-30 07:46:51ideasman42create