This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Title: profile doesn't support non-UTF8 source code
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.0
Status: closed Resolution: fixed
Dependencies: 4626 4628 Superseder:
Assigned To: Nosy List: brett.cannon, christian.heimes, shidot, vstinner
Priority: normal Keywords: needs review, patch

Created on 2008-11-08 02:49 by shidot, last changed 2022-04-11 14:56 by admin. This issue is now closed.

File name Uploaded Description Edit
profile_encoding.patch vstinner, 2008-11-10 00:40 profile module: open input file (from the command line) in binary mode
profile_encoding-2.patch vstinner, 2009-03-20 01:44
Messages (9)
msg75627 - (view) Author: Takafumi SHIDO (shidot) Date: 2008-11-08 02:49
The profile module of Python3 deesn't understand the character set of
the script.

When a profile is executed (like $python -m profile -o prof.dat
on a code (say which defines its character set in the second
line (like #coding:utf-8),
the profile crashes with an error message like:
"SyntaxError: unknown encoding: utf-8"
msg75676 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-11-10 00:40
exec() doesn't work if the argument is an unicode string. Here is a
workaround for the profile module (open the file in binary mode), but it
doesn't fix the exec() problem.
msg75677 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-11-10 01:03
Exemple of the problem: exec('#header\n# encoding:
ISO-8859-1\nprint("h\xe9 h\xe9")\n')

exec(unicode) calls source_as_string() which converts unicode to bytes
using _PyUnicode_AsDefaultEncodedString() (UTF-8 charset). Then
PyRun_StringFlags() is called with the UTF-8 byte string with
PyCF_SOURCE_IS_UTF8 flag. But in the parser, get_coding_spec() recognize
the "#coding:" header and convert bytes to unicode using the specified
charset (which may be different than UTF-8).

The problem is in the function PyAST_FromNode(): the flag in not used in
the tokenizer but only in the AST parser. I also see:
    if (flags && flags->cf_flags & PyCF_SOURCE_IS_UTF8) {
        c.c_encoding = "utf-8";
        if (TYPE(n) == encoding_decl) {
#if 0
            ast_error(n, "encoding declaration in Unicode string");
            goto error;
            n = CHILD(n, 0);
    } else if (TYPE(n) == encoding_decl) {
        c.c_encoding = STR(n);
        n = CHILD(n, 0);
    } else {
	/* PEP 3120 */
        c.c_encoding = "utf-8";

The ast_error() may be uncommented.
msg83842 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-03-20 01:25
This bug was a duplicate of #4626 which was fixed by r70113 ;-)
msg83843 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-03-20 01:30
Oops, i misread this issue (wrong title!). #4626 is related, but this 
issue is about the profile module. The problem is that profile open 
the source code as text (with the default charset: UTF-8).

Attached patch fixes the problem.

--- (ISO-8859-1 text file) ---
#coding: ISO-8859-1
print("hé hé")

Run: python -m profile

Current result:
  File ".../py3k/Lib/", line 614, in main
    script =
  File ".../Lib/", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode bytes (...)

With my patch, it works as expected.
msg83844 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-03-20 01:44
Oops, benjamin noticed that it doesn't work with Windows end of line 
(\r\n). New patch reads the file encoding instead of reading file 
content as bytes.
msg83846 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-03-20 01:56
This regression was introduced by the removal of execfile() in 
Python3. The proposed replacement of execfile() is wrong. I propose a 
generic fix in the issue #5524.
msg83933 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-03-21 10:51
After some discussions, I think that my first patch 
(profile_encoding.patch) was correct but we also have to fix #4628.
msg101477 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-03-22 02:00
Fixed by r79271 (py3k), r79272 (3.1).
Date User Action Args
2022-04-11 14:56:41adminsetgithub: 48532
2010-03-22 02:00:33vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg101477
2009-03-21 10:51:36vstinnersetdependencies: + No universal newline support for compile() when using bytes
messages: + msg83933
2009-03-20 01:56:57vstinnersetmessages: + msg83846
2009-03-20 01:44:43vstinnersetfiles: + profile_encoding-2.patch
keywords: + patch
messages: + msg83844
2009-03-20 01:38:47brett.cannonsetkeywords: - patch
stage: test needed -> patch review
2009-03-20 01:30:45vstinnersetkeywords: + needs review
2009-03-20 01:30:35vstinnersetstatus: closed -> open
title: exec(unicode): invalid charset when #coding:xxx spec is used -> profile doesn't support non-UTF8 source code
messages: + msg83843

dependencies: + compile() doesn't ignore the source encoding when a string is passed in
resolution: fixed -> (no value)
2009-03-20 01:25:22vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg83842
2008-11-10 09:48:00vstinnersettitle: (Python3) The profile module deesn't understand the character set definition -> exec(unicode): invalid charset when #coding:xxx spec is used
2008-11-10 09:46:37vstinnersetnosy: + brett.cannon
2008-11-10 01:03:28vstinnersetmessages: + msg75677
2008-11-10 00:40:37vstinnersetfiles: + profile_encoding.patch
keywords: + patch
messages: + msg75676
nosy: + vstinner
2008-11-09 17:39:21christian.heimessetpriority: normal
nosy: + christian.heimes
type: crash -> behavior
components: + Library (Lib)
stage: test needed
2008-11-08 02:49:31shidotcreate