classification
Title: Support undecodable filenames in the parser API
Type: Stage:
Components: Interpreter Core, Unicode Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: loewis, vstinner
Priority: normal Keywords:

Created on 2010-10-13 23:52 by vstinner, last changed 2010-10-14 12:05 by vstinner. This issue is now closed.

Messages (3)
msg118604 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-13 23:52
It looks like the parser API (eg. PyParser_ParseFileFlagsEx, PyParser_ASTFromFile) expects utf-8 filename: err_input() decodes the filename from utf-8. But 

Example in a non-ascii directory (/home/SHARE/SVN/py3kéŁ) and an ascii locale:
----
$ LANG= ./python -c "import inspect"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/SHARE/SVN/py3k\xe9\u0141/Lib/inspect.py", line 1
SyntaxError: encoding problem: with BOM
----

The problem occurs in fp_setreadl(): this function reopens the file with the right encoding. But to open the file, the bytes filename is decoded from utf-8 (in strict mode), whereas the filename (in my example) contains surrogates and utf-8 in strict mode rejects surrogates.

To support undecodable filenames in the parser API, we have two solutions:

 * Use the filesystem encoding with surrogateescape (PyUnicode_EncodeFSDefault, PyUnicode_DecodeFSDefault)
 * Use utf-8 in another mode: surrogateescape or surrogatepass

The parser API has many public functions, and we have to consider the compatibility with Python 3.1.

See also #9713 and #8611.
msg118630 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-14 07:16
We shouldn't need to reopen the file in the first place. If we already have a file handle, we can rewind it. Then the encoding of the file name becomes irrelevant.

I keep forgetting: what was the plan for deprecating the FILE* functions in the parser interface? If we need to continue to support them, we could read the whole contents of the file before parsing, and then use the memory-based parsing algorithm.

If parsing files can be fully based on the IO module, we shouldn't even need to rewind the file. Instead, the io module should support switching the encoding mid-stream (unless, say, we are in the middle of a multibyte character - since the parser always asks for complete lines, this should not happen).
msg118653 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-14 12:05
> We shouldn't need to reopen the file in the first place. 
> If we already have a file handle, we can rewind it.
> Then the encoding of the file name becomes irrelevant.

Oh yes, great idea. r85476 implements this solution (use lseek(0) on fileno(tok->fp)). The code path exists but only if filename was NULL. But I don't think that it worked because there was no call to lseek(0).

The commit fixes this issue (LANG= ./python -c "import inspect). I have other issues with "LANG= ./python Lib/test/regrtest.py -v test_pydoc test_traceback" but I think that it is a new (different) issue.
History
Date User Action Args
2010-10-14 12:05:45vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg118653
2010-10-14 07:16:59loewissetnosy: + loewis
messages: + msg118630
2010-10-13 23:52:56vstinnercreate