Issue 18873: "Encoding" detected in non-comment lines

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/63073

classification

Title:	"Encoding" detected in non-comment lines
Type:	behavior	Stage:	resolved
Components:	2to3 (2.x to 3.x conversion tool), Demos and Tools, Library (Lib)	Versions:	Python 3.4, Python 3.5, Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	serhiy.storchaka	Nosy List:	Paul.Bonser, armicron, benjamin.peterson, georg.brandl, kbk, loewis, meador.inge, python-dev, r.david.murray, roger.serwy, serhiy.storchaka, terry.reedy
Priority:	normal	Keywords:	easy, patch

Created on 2013-08-28 23:50 by Paul.Bonser, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
tokenizer.patch	armicron, 2013-09-06 16:57		review
detect_encoding_in_comments_only.patch	serhiy.storchaka, 2013-09-07 15:15		review
pep0263_regex.diff	serhiy.storchaka, 2013-09-07 23:34

Messages (15)
msg196435 - (view)	Author: Paul Bonser (Paul.Bonser)	Date: 2013-08-28 23:50
lib2to3.pgen2.tokenize:detect_encoding looks for the regex "coding[:=]\s*([-\w.]+)" in the first two lines of the file without first checking if they are comment lines. You can get 2to3 to fail with "SyntaxError: unknown encoding: 0" with a single line file: coding=0 A simple fix would be to check that the line is a comment before trying to look up the encoding from that line.
msg197081 - (view)	Author: Sergey Vishnikin (armicron)	Date: 2013-09-06 16:57
-cookie_re = re.compile("coding[:=]\s([-\w.]+)") +cookie_re = re.compile("#[^\r\n]coding[:=]\s*([-\w.]+)") Regex matches only if the encoding expression is preceded by a comment.
msg197117 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-09-06 22:22
It will fail on: "#coding=0" I'm wondering why findall() is used to match this regexp.
msg197165 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-09-07 15:15
The tokenize module, 2to3, IDLE, and the Tools/scripts/findnocoding.py script affected by this bug. Proposed patch fixes this in all places and adds tests for tokenize and 2to3.
msg197166 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-09-07 15:19
And here is a patch which fixes the regular expression in PEP 263.
msg197196 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2013-09-07 23:09
Nasty bug. Running a file with 'coding=0', a quite legitimate assignment statement, causes Idle to close, with LookupError, leading to SyntaxError, reported on the console if there is one ('crash' otherwise). (Idle closing is a separate problem, with an issue, from the misinterpretation of 'coding'.) Loading such a file works with a warning that should not be there. Adding # leads to "SyntaxError: unknown encoding" in a message box, without closing Idle. I presume this is to be expected and is proper. There is also a warning on loading. The code patch adds '^[ \t\f]' to the re. \f = FormFeed? Should that really be there? The PEP patch instead adds '^[ \t\v]', \v= VerticalTab? Same question, and why the difference? Your other changes to IOBinding.coding_spec look correct and fix a couple of bugs in the function (searching all lines for the coding cookie, mangling a line without a line end). Someone else should review the other code changes.
msg197201 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-09-07 23:34
> The code patch adds '^[ \t\f]' to the re. \f = FormFeed? Should that really be there? The PEP patch instead adds '^[ \t\v]', \v= VerticalTab? Same question, and why the difference? Good catch. I missed in the PEP patch, it should be '\f' ('\014') in all cases. Yes, it should be. It corresponds to the code in Parser/tokenizer.c.
msg197944 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2013-09-16 20:41
One of the problem with encoding recognition is that the same logic is more-or-less reproduced multiple places, so any fix needs to be applied multiple places. From the detect_encoding_in_comments_only.patch: Lib/idlelib/IOBinding.py Lib/lib2to3/pgen2/tokenize.py Lib/tokenize.py Tools/scripts/findnocoding.py Any fix for issues 18960 and 18961 may also need multiple applications. If there is not now, it would be nice if there were just one python-coded function in Lib/tokenize.py that could be imported and used by the other python code. (I was going to suggest exposing the function in tokenize.c, but I believe the point of tokenize.py is to not be dependent on CPython.) I believe the Idle support for \r became obsolete when support for MacOS9 was dropped in 2.4. I notice that it is not part of io universal newline support.
msg197946 - (view)	Author: Roundup Robot (python-dev)	Date: 2013-09-16 21:05
New changeset 2dfe8262093c by Serhiy Storchaka in branch '3.3': Issue #18873: The tokenize module, IDLE, 2to3, and the findnocoding.py script http://hg.python.org/cpython/rev/2dfe8262093c New changeset 6b747ad4a99a by Serhiy Storchaka in branch 'default': Issue #18873: The tokenize module, IDLE, 2to3, and the findnocoding.py script http://hg.python.org/cpython/rev/6b747ad4a99a New changeset 3d46ef0c62c5 by Serhiy Storchaka in branch '2.7': Issue #18873: IDLE, 2to3, and the findnocoding.py script now detect Python http://hg.python.org/cpython/rev/3d46ef0c62c5
msg197948 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-09-16 21:19
> If there is not now, it would be nice if there were just one python-coded function in Lib/tokenize.py that could be imported and used by the other python code. Agree. But look how many tokenize issues are opened around. Thank you for your report Paul. I left PEP 263 not fixed yet. Perhaps it needs rewording (especially in the light of other issues, such as #18960 and #18961).
msg197956 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-09-17 01:18
This appears to be resulting in buildbot lib2to3 test failures. ex: http://buildbot.python.org/all/builders/x86%20Ubuntu%20Shared%202.7/builds/2319/steps/test/logs/stdio http://buildbot.python.org/all/builders/PPC64%20PowerLinux%202.7/builds/206/steps/test/logs/stdio
msg197964 - (view)	Author: Roundup Robot (python-dev)	Date: 2013-09-17 07:09
New changeset f16855d6d4e1 by Serhiy Storchaka in branch '2.7': Remove the use of non-existing re.ASCII. http://hg.python.org/cpython/rev/f16855d6d4e1
msg197965 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-09-17 07:11
Thanks, David.
msg228299 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2014-10-03 02:58
This looks like it could be closed. We normally do not patch PEPs after they are implemented. Does a corrected version of something in PEP263 need to be added to the ref manual?
msg229147 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-10-12 14:24
I haven't fixed all bugs in handling encoding cookie yet (there are separate issues). Well, this issue can be closed, I'll open new issue about the PEP when will be needed. The PEP should be corrected because it affects how other Python implementations and other tools handle this.

History
Date	User	Action	Args
2022-04-11 14:57:50	admin	set	github: 63073
2014-10-12 14:24:36	serhiy.storchaka	set	status: open -> closed resolution: fixed messages: + msg229147 stage: patch review -> resolved
2014-10-03 02:58:37	terry.reedy	set	messages: + msg228299 components: - IDLE versions: + Python 3.5, - Python 3.3
2013-09-17 07:11:20	serhiy.storchaka	set	messages: + msg197965
2013-09-17 07:09:31	python-dev	set	messages: + msg197964
2013-09-17 01:18:42	r.david.murray	set	nosy: + r.david.murray messages: + msg197956
2013-09-16 21:19:40	serhiy.storchaka	set	messages: + msg197948
2013-09-16 21:05:24	python-dev	set	nosy: + python-dev messages: + msg197946
2013-09-16 20:41:15	terry.reedy	set	messages: + msg197944
2013-09-07 23:34:59	serhiy.storchaka	set	files: - pep0263_regex.diff
2013-09-07 23:34:33	serhiy.storchaka	set	files: + pep0263_regex.diff messages: + msg197201
2013-09-07 23:09:45	terry.reedy	set	messages: + msg197196
2013-09-07 15:19:51	serhiy.storchaka	set	assignee: serhiy.storchaka
2013-09-07 15:19:09	serhiy.storchaka	set	files: + pep0263_regex.diff messages: + msg197166
2013-09-07 15:15:06	serhiy.storchaka	set	files: + detect_encoding_in_comments_only.patch nosy: + loewis, georg.brandl, terry.reedy, kbk, roger.serwy, meador.inge messages: + msg197165 components: + Demos and Tools, IDLE, Library (Lib) stage: needs patch -> patch review
2013-09-06 22:22:35	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg197117
2013-09-06 16:57:02	armicron	set	files: + tokenizer.patch nosy: + armicron messages: + msg197081 keywords: + patch
2013-08-29 07:36:22	serhiy.storchaka	set	versions: + Python 3.4, - Python 3.1, Python 3.2 nosy: + benjamin.peterson keywords: + easy type: crash -> behavior stage: needs patch
2013-08-28 23:50:28	Paul.Bonser	create