classification
Title: PyTraceBack_Print() doesn't respect # coding: xxx header
Type: Stage:
Components: Versions: Python 3.0
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, haypo
Priority: normal Keywords: patch

Created on 2008-09-26 15:12 by haypo, last changed 2008-10-09 23:39 by amaury.forgeotdarc. This issue is now closed.

Files
File name Uploaded Description Edit
unicode.py haypo, 2008-10-08 10:51
traceback_unicode-5.patch haypo, 2008-10-08 11:16 New version of _Py_DisplaySourceLine() supporting coding header (patch version 5)
Messages (16)
msg73851 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-09-26 15:12
PyTraceBack_Print() doesn't take care of the "# coding: xxx" header of 
a Python script. It calls _Py_DisplaySourceLine() which opens the file 
as a byte stream (and not an unicode characters stream). Because of 
this problem, the traceback maybe truncated or invalid. Example (write 
it into a file and execute the file):
----
from sys import executable
from os import execvpe
filename = "pouet.py"
out = open(filename, "wb")
out.write(b"""# -*- coding: GBK -*-
print("--asc\xA1\xA7i--")
raise Exception("--asc\xA1\xA7i--")""")
out.close()
execvpe(executable, [executable, filename], None)
----

This issue depends on issue2384 (line number).

Note: Python 2.6 may also has the problem but it doesn't parse "# 
coding: GBK". So it's a different problem (issue?).
msg73859 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-09-26 16:26
Here is a new version of _Py_DisplaySourceLine() using 
PyTokenizer_FindEncoding() to read the coding header, and 
PyFile_FromFd() to create an "unicode-awake" file. The code could be 
optimized, but it least it displays correctly the file line ;-)

The code is based on the original _Py_DisplaySourceLine() and 
call_find_module() (import.c).

Notes:
* The code is young and new, it might be delayed until python 3.1
* Some functions may raise new exceptions (eg. MemoryError). I don't 
know how Python will react if an exception is raised during 
PyTraceBack_Print() ?
* The return code is 0 for success, but is it -1 or 1 for an error? It 
looks like error is "result != 0", so both -1 and 1 should be valid. I 
used -1.
msg73862 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-09-26 16:28
(oops, first patch included an useless whitespace change)
msg73928 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-09-27 15:27
Ooops, my first version introduces a regression: if file open fails, 
the traceback printing was stopped. Here is a new version of my patch 
to support #coding: header in _Py_DisplaySourceLine(). It doesn't 
print the line of file open fails, but continue to display the end of 
the traceback.

But print still stops on PyFile_WriteObject() or PyFile_WriteString(). 
If PyFile fails, I guess that next print will also fails. (it's also 
the current behaviour of PyTraceBack_Print).

Example:
----
Python 3.0rc1+ (py3k:66643M, Sep 27 2008, 17:11:51)
>>> raise Exception('err')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
Exception: err
----

The line is not displayed (why? no idea), but the exception 
("Exception: err") is still displayed.
msg74441 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-10-07 12:21
> The line is not displayed (why? no idea)
Well, it's difficult to re-open <stdin>...
msg74443 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-10-07 13:05
The patch goes in the good direction, here are some remarks:

- The two calls to open() are missing the O_BINARY flag, necessary on
Windows to avoid newline translation. This may be important for some codecs.

- the GIL should be released around calls to open(); searching the whole
sys.path can be long.

Besides this, the "char* filename" is relevant only on utf8-encoded
filesystems, and will not work on windows with non-ascii filenames. But
this is another issue.
msg74450 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-10-07 13:50
Le Tuesday 07 October 2008 15:06:01 Amaury Forgeot d'Arc, vous avez écrit :
> - The two calls to open() are missing the O_BINARY flag, necessary on
> Windows to avoid newline translation. This may be important for some
> codecs.

Oops, I always forget Windows specific options :-/

> - the GIL should be released around calls to open(); searching the whole
> sys.path can be long.

Ok, O_BINARY and GIL fixed.

I just realized that sys.path is considered as a bytes string (PyBytes_Check) 
whereas Python3 uses an unicode string!

['', '/home/haypo/prog/python-ptrace', (...)]
>>> sys.path.append(b'bytes'); sys.path
['', '/home/haypo/prog/python-ptrace', '(...)', b'bytes']

Oops, so it's possible to mix strings and bytes. But the normal case is a list 
of unicode strings. Another fix is needed :-/

I don't know how to get into "if (fd < 0)" (code using sys.path). Any clue?

> Besides this, the "char* filename" is relevant only on utf8-encoded
> filesystems, and will not work on windows with non-ascii filenames. But
> this is another issue.

I don't know the Windows API enough to answer.

Doesn't Python provide "high level" functions to open a file?
msg74455 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-10-07 14:54
> Ok, O_BINARY and GIL fixed.
which patch? ;-)

> I just realized that sys.path is considered as a bytes string
> (PyBytes_Check) whereas Python3 uses an unicode string!
> 
> ['', '/home/haypo/prog/python-ptrace', (...)]
> >>> sys.path.append(b'bytes'); sys.path
> ['', '/home/haypo/prog/python-ptrace', '(...)', b'bytes']
> 
> Oops, so it's possible to mix strings and bytes. But the normal case
is a list 
> of unicode strings. Another fix is needed :-/

You could do like the import function, which converts unicode with the
fs encoding. See the code in import.c, below "PyList_GetItem(path, i)"

> I don't know how to get into "if (fd < 0)" (code using sys.path).
> Any clue?

The .pyc stores the full path of the .py, at compilation time. If the
directory is moved, or accessed with another name (via mounts or soft
links), the traceback displays the filename as stored in the .pyc file,
but the content can still be displayed. It does not work every time,
though, for example when the directory is shared between Windows & Unix.

> > Besides this, the "char* filename" is relevant only on utf8-encoded
> > filesystems, and will not work on windows with non-ascii filenames.
> > But this is another issue.
> 
> I don't know the Windows API enough to answer.

Windows interprets the char* variables with its system-set "mbcs"
encoding. On my machine, it is equivalent to the cp1252 encoding (almost
equivalent latin-1). Since the 'filename' comes from a call to
_PyUnicode_AsString(tb->tb_frame->f_code->co_filename), it will be wrong
with any accented letter.

> Doesn't Python provide "high level" functions to open a file?
io.open()?
msg74460 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-10-07 15:45
I still have remarks about traceback_unicode-3.patch, that I did not see
before:
- (a minor thing) PyMem_FREE(found_encoding) could appear only once,
just after PyFile_FromFd.
- I feel it dangerous to call the PyUnicode_AS_UNICODE() macro without
checking that PyFile_GetLine() actually returned a PyUnicode object.
- If the "truncated" variable is NULL, an exception has been set; it is
necessary to exit the function.
msg74493 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-10-07 22:45
Thanks for your remarks amaury. I improved my patch:
 - PyMem_FREE(found_encoding) is called just after PyFile_FromFd()
 - Create static subfunction _Py_FindSourceFile(): find a file in 
sys.path
 - Consider that sys.path contains only unicode (and not bytes) string
 - Clear exception on "truncated = PyUnicode_FromUnicode(p, len);" 
error, but continue the execution
 - I added PyUnicode_check(lineobj)
 - Replace open(filename) by open(namebuf) (while searching the full 
path of the file) <= fix a regression introduced by my patch

Should I stop on the first error instead of using PyErr_Clear()? I 
would like to display the traceback even if an function failed.

Sum up of the patch version 4:
 - use open(O_RDONLY | O_BINARY) + PyFile_FromFd() instead of fopen()
 - open the file in unicode mode using the Python encoding found in 
the "#coding:..." header
 - consider sys.path as a list of unicode strings (and not of bytes 
strings)

I used _PyUnicode_AsStringAndSize() to convert unicode to string to be 
consistent with tb_printinternal(). As you noticed, it uses UTF-8 
which should is on Windows :-/ I propose to open a new issue when this 
one will be closed :-)
msg74499 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-10-08 00:07
One last thing: there is a path where lineobj is not freed (when PyUnicode_Check(lineobj) is false); I suggest to move 
Py_XDECREF(lineobj) just before the final return statement.
Reference counting is fun ;-)

> Should I stop on the first error instead of using PyErr_Clear()?
> I would like to display the traceback even if an function failed.
You are right. Common failures should clear the error and return 0.
In your patch, there is one remaining place, the call to PyFile_GetLine().

More fun will arise when my Windows terminal (encoding=cp1252) will try 
to display Chinese characters. Let's pretend this is yet another issue.
msg74524 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-10-08 10:51
> More fun will arise when my Windows terminal (encoding=cp1252) 
> will try to display Chinese characters. Let's pretend this is 
> yet another issue.

I tried the patch using a script with unicode characters (character 
not representable in ISO-8859-1 like polish characters ł and Ł).

Result in an UTF-8 terminal (my default locale):
   Traceback (most recent call last):
     File "unicode.py", line 2, in <module>
       raise ValueError("unicode: Łł")
    ValueError: unicode: Łł
=> correct

Result in an ISO-8859-1 terminal (I changed the encoding in my Konsole 
configuration):
   Traceback (most recent call last):
     File "unicode.py", line 2, in <module>
       raise ValueError("unicode: \u0141\u0142")
   ValueError: unicode: \u0141\u0142
=> correct

Why does it work? It's because PyErr_Display() uses sys.stderr instead 
of sys.stdout and sys.stderr uses a different unicode error mechanism:
   >>> import sys
   >>> sys.stdout.errors
   'strict'
   >>> sys.stderr.errors
   'backslashreplace'

Which is a great idea :-)

You can try on Windows using the new attached file unicode.py.
msg74525 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-10-08 11:15
@amaury: Oops, yes, I introduced a refleak in the version 4 with the 
PyUnicode_Check(). Instead of just moved Py_(X)RECREF(lineobj);, I 
could not not resist to refactor the code to remove one more 
indentation level (I prefer if (...) return; instead of if (...) { 
very long block; }).

Changes in version 5:
 - rename 'namebuf' buffer to 'buf', it's used for the filename and to 
display the indentation space (strcpy(buf, '          ');).
 - move Py_DECREF(fob); at the end of the GetLine loop
 - return on lineobj error

I think that the new version is easier to read than the current code 
because they are few indentation and no more local variables (if (...) 
{ local var; ... })
msg74536 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-10-08 17:17
The code is indeed easier to follow.
I don't have any more remark, thanks to you perseverance!

Now, is there some unit test we could provide? #2384 depends on this
issue, it should be easy to extract a small test case.
msg74539 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-10-08 17:37
My patch for #2384 contains a testcase which require #3975 and #2384 
to be fixed (you have to apply both patches to test it).
msg74610 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-10-09 23:39
Committed r66867, together with #2384. Thanks for your perseverance ;-)
History
Date User Action Args
2008-10-09 23:39:24amaury.forgeotdarcsetstatus: open -> closed
resolution: fixed
messages: + msg74610
2008-10-08 17:37:39hayposetmessages: + msg74539
2008-10-08 17:17:49amaury.forgeotdarcsetmessages: + msg74536
2008-10-08 11:16:04hayposetfiles: + traceback_unicode-5.patch
2008-10-08 11:15:54hayposetfiles: - traceback_unicode-5.patch
2008-10-08 11:15:34hayposetfiles: - traceback_unicode-4.patch
2008-10-08 11:15:28hayposetfiles: + traceback_unicode-5.patch
messages: + msg74525
2008-10-08 10:51:41hayposetfiles: + unicode.py
messages: + msg74524
2008-10-08 00:12:54hayposetfiles: - traceback_unicode-3.patch
2008-10-08 00:12:52hayposetfiles: - traceback_unicode-2.patch
2008-10-08 00:12:49hayposetfiles: - traceback_unicode.patch
2008-10-08 00:07:35amaury.forgeotdarcsetmessages: + msg74499
2008-10-07 23:44:00amaury.forgeotdarclinkissue2384 dependencies
2008-10-07 22:45:52hayposetfiles: + traceback_unicode-4.patch
messages: + msg74493
2008-10-07 15:45:29amaury.forgeotdarcsetmessages: + msg74460
2008-10-07 15:00:32hayposetfiles: + traceback_unicode-3.patch
2008-10-07 14:55:00amaury.forgeotdarcsetmessages: + msg74455
2008-10-07 13:50:10hayposetmessages: + msg74450
2008-10-07 13:05:58amaury.forgeotdarcsetmessages: + msg74443
2008-10-07 12:21:36amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg74441
2008-09-27 15:27:30hayposetfiles: + traceback_unicode-2.patch
messages: + msg73928
2008-09-26 16:28:20hayposetfiles: + traceback_unicode.patch
messages: + msg73862
2008-09-26 16:27:52hayposetfiles: - traceback_unicode.patch
2008-09-26 16:26:51hayposetfiles: + traceback_unicode.patch
keywords: + patch
messages: + msg73859
2008-09-26 15:12:28haypocreate