Issue 4282: profile doesn't support non-UTF8 source code

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/48532

classification

Title:	profile doesn't support non-UTF8 source code
Type:	behavior	Stage:	patch review
Components:	Library (Lib)	Versions:	Python 3.0

process

Status:	closed	Resolution:	fixed
Dependencies:	4626 4628	Superseder:
Assigned To:		Nosy List:	brett.cannon, christian.heimes, shidot, vstinner
Priority:	normal	Keywords:	needs review, patch

Created on 2008-11-08 02:49 by shidot, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
profile_encoding.patch	vstinner, 2008-11-10 00:40	profile module: open input file (from the command line) in binary mode
profile_encoding-2.patch	vstinner, 2009-03-20 01:44

Messages (9)
msg75627 - (view)	Author: Takafumi SHIDO (shidot)	Date: 2008-11-08 02:49
The profile module of Python3 deesn't understand the character set of the script. When a profile is executed (like $python -m profile -o prof.dat foo.py) on a code (say foo.py) which defines its character set in the second line (like #coding:utf-8), the profile crashes with an error message like: "SyntaxError: unknown encoding: utf-8"
msg75676 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-11-10 00:40
exec() doesn't work if the argument is an unicode string. Here is a workaround for the profile module (open the file in binary mode), but it doesn't fix the exec() problem.
msg75677 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-11-10 01:03
Exemple of the problem: exec('#header\n# encoding: ISO-8859-1\nprint("h\xe9 h\xe9")\n') exec(unicode) calls source_as_string() which converts unicode to bytes using _PyUnicode_AsDefaultEncodedString() (UTF-8 charset). Then PyRun_StringFlags() is called with the UTF-8 byte string with PyCF_SOURCE_IS_UTF8 flag. But in the parser, get_coding_spec() recognize the "#coding:" header and convert bytes to unicode using the specified charset (which may be different than UTF-8). The problem is in the function PyAST_FromNode(): the flag in not used in the tokenizer but only in the AST parser. I also see: if (flags && flags->cf_flags & PyCF_SOURCE_IS_UTF8) { c.c_encoding = "utf-8"; if (TYPE(n) == encoding_decl) { #if 0 ast_error(n, "encoding declaration in Unicode string"); goto error; #endif n = CHILD(n, 0); } } else if (TYPE(n) == encoding_decl) { c.c_encoding = STR(n); n = CHILD(n, 0); } else { /* PEP 3120 */ c.c_encoding = "utf-8"; } The ast_error() may be uncommented.
msg83842 - (view)	Author: STINNER Victor (vstinner) *	Date: 2009-03-20 01:25
This bug was a duplicate of #4626 which was fixed by r70113 ;-)
msg83843 - (view)	Author: STINNER Victor (vstinner) *	Date: 2009-03-20 01:30
Oops, i misread this issue (wrong title!). #4626 is related, but this issue is about the profile module. The problem is that profile open the source code as text (with the default charset: UTF-8). Attached patch fixes the problem. Example: --- x.py (ISO-8859-1 text file) --- #coding: ISO-8859-1 print("hé hé") ----------------------------------- Run: python -m profile x.py Current result: (...) File ".../py3k/Lib/profile.py", line 614, in main script = fp.read() File ".../Lib/codecs.py", line 300, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf8' codec can't decode bytes (...) With my patch, it works as expected.
msg83844 - (view)	Author: STINNER Victor (vstinner) *	Date: 2009-03-20 01:44
Oops, benjamin noticed that it doesn't work with Windows end of line (\r\n). New patch reads the file encoding instead of reading file content as bytes.
msg83846 - (view)	Author: STINNER Victor (vstinner) *	Date: 2009-03-20 01:56
This regression was introduced by the removal of execfile() in Python3. The proposed replacement of execfile() is wrong. I propose a generic fix in the issue #5524.
msg83933 - (view)	Author: STINNER Victor (vstinner) *	Date: 2009-03-21 10:51
After some discussions, I think that my first patch (profile_encoding.patch) was correct but we also have to fix #4628.
msg101477 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-03-22 02:00
Fixed by r79271 (py3k), r79272 (3.1).

History
Date	User	Action	Args
2022-04-11 14:56:41	admin	set	github: 48532
2010-03-22 02:00:33	vstinner	set	status: open -> closed resolution: fixed messages: + msg101477
2009-03-21 10:51:36	vstinner	set	dependencies: + No universal newline support for compile() when using bytes messages: + msg83933
2009-03-20 01:56:57	vstinner	set	messages: + msg83846
2009-03-20 01:44:43	vstinner	set	files: + profile_encoding-2.patch keywords: + patch messages: + msg83844
2009-03-20 01:38:47	brett.cannon	set	keywords: - patch stage: test needed -> patch review
2009-03-20 01:30:45	vstinner	set	keywords: + needs review
2009-03-20 01:30:35	vstinner	set	status: closed -> open title: exec(unicode): invalid charset when #coding:xxx spec is used -> profile doesn't support non-UTF8 source code messages: + msg83843 dependencies: + compile() doesn't ignore the source encoding when a string is passed in resolution: fixed -> (no value)
2009-03-20 01:25:22	vstinner	set	status: open -> closed resolution: fixed messages: + msg83842
2008-11-10 09:48:00	vstinner	set	title: (Python3) The profile module deesn't understand the character set definition -> exec(unicode): invalid charset when #coding:xxx spec is used
2008-11-10 09:46:37	vstinner	set	nosy: + brett.cannon
2008-11-10 01:03:28	vstinner	set	messages: + msg75677
2008-11-10 00:40:37	vstinner	set	files: + profile_encoding.patch keywords: + patch messages: + msg75676 nosy: + vstinner
2008-11-09 17:39:21	christian.heimes	set	priority: normal nosy: + christian.heimes type: crash -> behavior components: + Library (Lib) stage: test needed
2008-11-08 02:49:31	shidot	create