Issue 10095: Support undecodable filenames in the parser API

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/54304

classification

Title:	Support undecodable filenames in the parser API
Type:		Stage:
Components:	Interpreter Core, Unicode	Versions:	Python 3.2

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	loewis, vstinner
Priority:	normal	Keywords:

Created on 2010-10-13 23:52 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (3)
msg118604 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-10-13 23:52
It looks like the parser API (eg. PyParser_ParseFileFlagsEx, PyParser_ASTFromFile) expects utf-8 filename: err_input() decodes the filename from utf-8. But Example in a non-ascii directory (/home/SHARE/SVN/py3kéŁ) and an ascii locale: ---- $ LANG= ./python -c "import inspect" Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/SHARE/SVN/py3k\xe9\u0141/Lib/inspect.py", line 1 SyntaxError: encoding problem: with BOM ---- The problem occurs in fp_setreadl(): this function reopens the file with the right encoding. But to open the file, the bytes filename is decoded from utf-8 (in strict mode), whereas the filename (in my example) contains surrogates and utf-8 in strict mode rejects surrogates. To support undecodable filenames in the parser API, we have two solutions: * Use the filesystem encoding with surrogateescape (PyUnicode_EncodeFSDefault, PyUnicode_DecodeFSDefault) * Use utf-8 in another mode: surrogateescape or surrogatepass The parser API has many public functions, and we have to consider the compatibility with Python 3.1. See also #9713 and #8611.
msg118630 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2010-10-14 07:16
We shouldn't need to reopen the file in the first place. If we already have a file handle, we can rewind it. Then the encoding of the file name becomes irrelevant. I keep forgetting: what was the plan for deprecating the FILE* functions in the parser interface? If we need to continue to support them, we could read the whole contents of the file before parsing, and then use the memory-based parsing algorithm. If parsing files can be fully based on the IO module, we shouldn't even need to rewind the file. Instead, the io module should support switching the encoding mid-stream (unless, say, we are in the middle of a multibyte character - since the parser always asks for complete lines, this should not happen).
msg118653 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-10-14 12:05
> We shouldn't need to reopen the file in the first place. > If we already have a file handle, we can rewind it. > Then the encoding of the file name becomes irrelevant. Oh yes, great idea. r85476 implements this solution (use lseek(0) on fileno(tok->fp)). The code path exists but only if filename was NULL. But I don't think that it worked because there was no call to lseek(0). The commit fixes this issue (LANG= ./python -c "import inspect). I have other issues with "LANG= ./python Lib/test/regrtest.py -v test_pydoc test_traceback" but I think that it is a new (different) issue.

History
Date	User	Action	Args
2022-04-11 14:57:07	admin	set	github: 54304
2010-10-14 12:05:45	vstinner	set	status: open -> closed resolution: fixed messages: + msg118653
2010-10-14 07:16:59	loewis	set	nosy: + loewis messages: + msg118630
2010-10-13 23:52:56	vstinner	create