Issue 3975: PyTraceBack_Print() doesn't respect # coding: xxx header

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/48225

classification

Title:	PyTraceBack_Print() doesn't respect # coding: xxx header
Type:		Stage:
Components:		Versions:	Python 3.0

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	amaury.forgeotdarc, vstinner
Priority:	normal	Keywords:	patch

Created on 2008-09-26 15:12 by vstinner, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
unicode.py	vstinner, 2008-10-08 10:51
traceback_unicode-5.patch	vstinner, 2008-10-08 11:16	New version of _Py_DisplaySourceLine() supporting coding header (patch version 5)

Messages (16)
msg73851 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-09-26 15:12
PyTraceBack_Print() doesn't take care of the "# coding: xxx" header of a Python script. It calls _Py_DisplaySourceLine() which opens the file as a byte stream (and not an unicode characters stream). Because of this problem, the traceback maybe truncated or invalid. Example (write it into a file and execute the file): ---- from sys import executable from os import execvpe filename = "pouet.py" out = open(filename, "wb") out.write(b"""# -- coding: GBK -- print("--asc\xA1\xA7i--") raise Exception("--asc\xA1\xA7i--")""") out.close() execvpe(executable, [executable, filename], None) ---- This issue depends on issue2384 (line number). Note: Python 2.6 may also has the problem but it doesn't parse "# coding: GBK". So it's a different problem (issue?).
msg73859 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-09-26 16:26
Here is a new version of _Py_DisplaySourceLine() using PyTokenizer_FindEncoding() to read the coding header, and PyFile_FromFd() to create an "unicode-awake" file. The code could be optimized, but it least it displays correctly the file line ;-) The code is based on the original _Py_DisplaySourceLine() and call_find_module() (import.c). Notes: * The code is young and new, it might be delayed until python 3.1 * Some functions may raise new exceptions (eg. MemoryError). I don't know how Python will react if an exception is raised during PyTraceBack_Print() ? * The return code is 0 for success, but is it -1 or 1 for an error? It looks like error is "result != 0", so both -1 and 1 should be valid. I used -1.
msg73862 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-09-26 16:28
(oops, first patch included an useless whitespace change)
msg73928 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-09-27 15:27
Ooops, my first version introduces a regression: if file open fails, the traceback printing was stopped. Here is a new version of my patch to support #coding: header in _Py_DisplaySourceLine(). It doesn't print the line of file open fails, but continue to display the end of the traceback. But print still stops on PyFile_WriteObject() or PyFile_WriteString(). If PyFile fails, I guess that next print will also fails. (it's also the current behaviour of PyTraceBack_Print). Example: ---- Python 3.0rc1+ (py3k:66643M, Sep 27 2008, 17:11:51) >>> raise Exception('err') Traceback (most recent call last): File "<stdin>", line 1, in <module> Exception: err ---- The line is not displayed (why? no idea), but the exception ("Exception: err") is still displayed.
msg74441 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2008-10-07 12:21
> The line is not displayed (why? no idea) Well, it's difficult to re-open <stdin>...
msg74443 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2008-10-07 13:05
The patch goes in the good direction, here are some remarks: - The two calls to open() are missing the O_BINARY flag, necessary on Windows to avoid newline translation. This may be important for some codecs. - the GIL should be released around calls to open(); searching the whole sys.path can be long. Besides this, the "char* filename" is relevant only on utf8-encoded filesystems, and will not work on windows with non-ascii filenames. But this is another issue.
msg74450 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-10-07 13:50
Le Tuesday 07 October 2008 15:06:01 Amaury Forgeot d'Arc, vous avez écrit : > - The two calls to open() are missing the O_BINARY flag, necessary on > Windows to avoid newline translation. This may be important for some > codecs. Oops, I always forget Windows specific options :-/ > - the GIL should be released around calls to open(); searching the whole > sys.path can be long. Ok, O_BINARY and GIL fixed. I just realized that sys.path is considered as a bytes string (PyBytes_Check) whereas Python3 uses an unicode string! ['', '/home/haypo/prog/python-ptrace', (...)] >>> sys.path.append(b'bytes'); sys.path ['', '/home/haypo/prog/python-ptrace', '(...)', b'bytes'] Oops, so it's possible to mix strings and bytes. But the normal case is a list of unicode strings. Another fix is needed :-/ I don't know how to get into "if (fd < 0)" (code using sys.path). Any clue? > Besides this, the "char* filename" is relevant only on utf8-encoded > filesystems, and will not work on windows with non-ascii filenames. But > this is another issue. I don't know the Windows API enough to answer. Doesn't Python provide "high level" functions to open a file?
msg74455 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2008-10-07 14:54
> Ok, O_BINARY and GIL fixed. which patch? ;-) > I just realized that sys.path is considered as a bytes string > (PyBytes_Check) whereas Python3 uses an unicode string! > > ['', '/home/haypo/prog/python-ptrace', (...)] > >>> sys.path.append(b'bytes'); sys.path > ['', '/home/haypo/prog/python-ptrace', '(...)', b'bytes'] > > Oops, so it's possible to mix strings and bytes. But the normal case is a list > of unicode strings. Another fix is needed :-/ You could do like the import function, which converts unicode with the fs encoding. See the code in import.c, below "PyList_GetItem(path, i)" > I don't know how to get into "if (fd < 0)" (code using sys.path). > Any clue? The .pyc stores the full path of the .py, at compilation time. If the directory is moved, or accessed with another name (via mounts or soft links), the traceback displays the filename as stored in the .pyc file, but the content can still be displayed. It does not work every time, though, for example when the directory is shared between Windows & Unix. > > Besides this, the "char* filename" is relevant only on utf8-encoded > > filesystems, and will not work on windows with non-ascii filenames. > > But this is another issue. > > I don't know the Windows API enough to answer. Windows interprets the char* variables with its system-set "mbcs" encoding. On my machine, it is equivalent to the cp1252 encoding (almost equivalent latin-1). Since the 'filename' comes from a call to _PyUnicode_AsString(tb->tb_frame->f_code->co_filename), it will be wrong with any accented letter. > Doesn't Python provide "high level" functions to open a file? io.open()?
msg74460 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2008-10-07 15:45
I still have remarks about traceback_unicode-3.patch, that I did not see before: - (a minor thing) PyMem_FREE(found_encoding) could appear only once, just after PyFile_FromFd. - I feel it dangerous to call the PyUnicode_AS_UNICODE() macro without checking that PyFile_GetLine() actually returned a PyUnicode object. - If the "truncated" variable is NULL, an exception has been set; it is necessary to exit the function.
msg74493 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-10-07 22:45
Thanks for your remarks amaury. I improved my patch: - PyMem_FREE(found_encoding) is called just after PyFile_FromFd() - Create static subfunction _Py_FindSourceFile(): find a file in sys.path - Consider that sys.path contains only unicode (and not bytes) string - Clear exception on "truncated = PyUnicode_FromUnicode(p, len);" error, but continue the execution - I added PyUnicode_check(lineobj) - Replace open(filename) by open(namebuf) (while searching the full path of the file) <= fix a regression introduced by my patch Should I stop on the first error instead of using PyErr_Clear()? I would like to display the traceback even if an function failed. Sum up of the patch version 4: - use open(O_RDONLY \| O_BINARY) + PyFile_FromFd() instead of fopen() - open the file in unicode mode using the Python encoding found in the "#coding:..." header - consider sys.path as a list of unicode strings (and not of bytes strings) I used _PyUnicode_AsStringAndSize() to convert unicode to string to be consistent with tb_printinternal(). As you noticed, it uses UTF-8 which should is on Windows :-/ I propose to open a new issue when this one will be closed :-)
msg74499 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2008-10-08 00:07
One last thing: there is a path where lineobj is not freed (when PyUnicode_Check(lineobj) is false); I suggest to move Py_XDECREF(lineobj) just before the final return statement. Reference counting is fun ;-) > Should I stop on the first error instead of using PyErr_Clear()? > I would like to display the traceback even if an function failed. You are right. Common failures should clear the error and return 0. In your patch, there is one remaining place, the call to PyFile_GetLine(). More fun will arise when my Windows terminal (encoding=cp1252) will try to display Chinese characters. Let's pretend this is yet another issue.
msg74524 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-10-08 10:51
> More fun will arise when my Windows terminal (encoding=cp1252) > will try to display Chinese characters. Let's pretend this is > yet another issue. I tried the patch using a script with unicode characters (character not representable in ISO-8859-1 like polish characters ł and Ł). Result in an UTF-8 terminal (my default locale): Traceback (most recent call last): File "unicode.py", line 2, in <module> raise ValueError("unicode: Łł") ValueError: unicode: Łł => correct Result in an ISO-8859-1 terminal (I changed the encoding in my Konsole configuration): Traceback (most recent call last): File "unicode.py", line 2, in <module> raise ValueError("unicode: \u0141\u0142") ValueError: unicode: \u0141\u0142 => correct Why does it work? It's because PyErr_Display() uses sys.stderr instead of sys.stdout and sys.stderr uses a different unicode error mechanism: >>> import sys >>> sys.stdout.errors 'strict' >>> sys.stderr.errors 'backslashreplace' Which is a great idea :-) You can try on Windows using the new attached file unicode.py.
msg74525 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-10-08 11:15
@amaury: Oops, yes, I introduced a refleak in the version 4 with the PyUnicode_Check(). Instead of just moved Py_(X)RECREF(lineobj);, I could not not resist to refactor the code to remove one more indentation level (I prefer if (...) return; instead of if (...) { very long block; }). Changes in version 5: - rename 'namebuf' buffer to 'buf', it's used for the filename and to display the indentation space (strcpy(buf, ' ');). - move Py_DECREF(fob); at the end of the GetLine loop - return on lineobj error I think that the new version is easier to read than the current code because they are few indentation and no more local variables (if (...) { local var; ... })
msg74536 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2008-10-08 17:17
The code is indeed easier to follow. I don't have any more remark, thanks to you perseverance! Now, is there some unit test we could provide? #2384 depends on this issue, it should be easy to extract a small test case.
msg74539 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-10-08 17:37
My patch for #2384 contains a testcase which require #3975 and #2384 to be fixed (you have to apply both patches to test it).
msg74610 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2008-10-09 23:39
Committed r66867, together with #2384. Thanks for your perseverance ;-)

History
Date	User	Action	Args
2022-04-11 14:56:39	admin	set	github: 48225
2008-10-09 23:39:24	amaury.forgeotdarc	set	status: open -> closed resolution: fixed messages: + msg74610
2008-10-08 17:37:39	vstinner	set	messages: + msg74539
2008-10-08 17:17:49	amaury.forgeotdarc	set	messages: + msg74536
2008-10-08 11:16:04	vstinner	set	files: + traceback_unicode-5.patch
2008-10-08 11:15:54	vstinner	set	files: - traceback_unicode-5.patch
2008-10-08 11:15:34	vstinner	set	files: - traceback_unicode-4.patch
2008-10-08 11:15:28	vstinner	set	files: + traceback_unicode-5.patch messages: + msg74525
2008-10-08 10:51:41	vstinner	set	files: + unicode.py messages: + msg74524
2008-10-08 00:12:54	vstinner	set	files: - traceback_unicode-3.patch
2008-10-08 00:12:52	vstinner	set	files: - traceback_unicode-2.patch
2008-10-08 00:12:49	vstinner	set	files: - traceback_unicode.patch
2008-10-08 00:07:35	amaury.forgeotdarc	set	messages: + msg74499
2008-10-07 23:44:00	amaury.forgeotdarc	link	issue2384 dependencies
2008-10-07 22:45:52	vstinner	set	files: + traceback_unicode-4.patch messages: + msg74493
2008-10-07 15:45:29	amaury.forgeotdarc	set	messages: + msg74460
2008-10-07 15:00:32	vstinner	set	files: + traceback_unicode-3.patch
2008-10-07 14:55:00	amaury.forgeotdarc	set	messages: + msg74455
2008-10-07 13:50:10	vstinner	set	messages: + msg74450
2008-10-07 13:05:58	amaury.forgeotdarc	set	messages: + msg74443
2008-10-07 12:21:36	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarc messages: + msg74441
2008-09-27 15:27:30	vstinner	set	files: + traceback_unicode-2.patch messages: + msg73928
2008-09-26 16:28:20	vstinner	set	files: + traceback_unicode.patch messages: + msg73862
2008-09-26 16:27:52	vstinner	set	files: - traceback_unicode.patch
2008-09-26 16:26:51	vstinner	set	files: + traceback_unicode.patch keywords: + patch messages: + msg73859
2008-09-26 15:12:28	vstinner	create