Title: profile doesn't support non-UTF8 source code
Messages (9)
msg75627 - (view) Author: Takafumi SHIDO (shidot) Date: 2008-11-08 02:49
The profile module of Python3 deesn't understand the character set of
the script.

When a profile is executed (like $python -m profile -o prof.dat
on a code (say which defines its character set in the second
line (like #coding:utf-8),
the profile crashes with an error message like:
"SyntaxError: unknown encoding: utf-8"
msg75676 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-11-10 00:40
exec() doesn't work if the argument is an unicode string. Here is a
workaround for the profile module (open the file in binary mode), but it
doesn't fix the exec() problem.
msg75677 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-11-10 01:03
Exemple of the problem: exec('#header\n# encoding:
ISO-8859-1\nprint("h\xe9 h\xe9")\n')

exec(unicode) calls source_as_string() which converts unicode to bytes
using _PyUnicode_AsDefaultEncodedString() (UTF-8 charset). Then
PyRun_StringFlags() is called with the UTF-8 byte string with
PyCF_SOURCE_IS_UTF8 flag. But in the parser, get_coding_spec() recognize
the "#coding:" header and convert bytes to unicode using the specified
charset (which may be different than UTF-8).

The problem is in the function PyAST_FromNode(): the flag in not used in
the tokenizer but only in the AST parser. I also see:
    if (flags && flags->cf_flags & PyCF_SOURCE_IS_UTF8) {
        c.c_encoding = "utf-8";
        if (TYPE(n) == encoding_decl) {
#if 0
            ast_error(n, "encoding declaration in Unicode string");
            goto error;
            n = CHILD(n, 0);
    } else if (TYPE(n) == encoding_decl) {
        c.c_encoding = STR(n);
        n = CHILD(n, 0);
    } else {
	/* PEP 3120 */
        c.c_encoding = "utf-8";

The ast_error() may be uncommented.
msg83842 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-03-20 01:25
This bug was a duplicate of #4626 which was fixed by r70113 ;-)
msg83843 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-03-20 01:30
Oops, i misread this issue (wrong title!). #4626 is related, but this 
issue is about the profile module. The problem is that profile open 
the source code as text (with the default charset: UTF-8).

Attached patch fixes the problem.

--- (ISO-8859-1 text file) ---
#coding: ISO-8859-1
print("hé hé")

Run: python -m profile

Current result:
  File ".../py3k/Lib/", line 614, in main
    script =
  File ".../Lib/", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode bytes (...)

With my patch, it works as expected.
msg83844 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-03-20 01:44
Oops, benjamin noticed that it doesn't work with Windows end of line 
(\r\n). New patch reads the file encoding instead of reading file 
content as bytes.
msg83846 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-03-20 01:56
This regression was introduced by the removal of execfile() in 
Python3. The proposed replacement of execfile() is wrong. I propose a 
generic fix in the issue #5524.
msg83933 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-03-21 10:51
After some discussions, I think that my first patch 
(profile_encoding.patch) was correct but we also have to fix #4628.
msg101477 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-03-22 02:00
Fixed by r79271 (py3k), r79272 (3.1).
