Issue 27582: Mispositioned SyntaxError caret for unknown code points

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/71769

classification

Title:	Mispositioned SyntaxError caret for unknown code points
Type:	behavior	Stage:	resolved
Components:		Versions:	Python 3.6, Python 3.5

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	Drekin, Rosuav, berker.peksag, iritkatriel, ncoghlan, sjt
Priority:	normal	Keywords:	patch

Created on 2016-07-21 11:54 by ncoghlan, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
method1-change-cur.patch	Rosuav, 2016-07-21 14:19		review
method2-change-cur-and-inp.patch	Rosuav, 2016-07-21 14:19		review
method3-change-all-errors.patch	Rosuav, 2016-07-21 14:19		review
method4-change-all-errors-if-possible.patch	Rosuav, 2016-07-21 14:19		review
where-did-that-error-come-from.patch	Rosuav, 2016-07-21 16:33		review
combined.patch	Rosuav, 2016-07-21 16:35		review

Messages (12)
msg270914 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2016-07-21 11:54
Reporting by Rustom Mody on python-ideas, the SyntaxError caret is confusingly mispositioned when an invalid Unicode codepoint appears as part of a larger sequence of invalid codepoints and/or valid identifier characters: >>> varname = “d“a”t”apoint File "<stdin>", line 1 varname = “d“a”t”apoint ^ SyntaxError: invalid character in identifier >>> varname = “d“a”t”apoint.evidence File "<stdin>", line 1 varname = “d“a”t”apoint.evidence ^ SyntaxError: invalid character in identifier >>> varname = “d“a”t”apoint[evidence] File "<stdin>", line 1 varname = “d“a”t”apoint[evidence] ^ SyntaxError: invalid character in identifier >>> varname = “d“a”t”apoint(evidence) File "<stdin>", line 1 varname = “d“a”t”apoint(evidence) ^ SyntaxError: invalid character in identifier If the invalid character is a non-identifiers ASCII character, the error message loses the trailing "in identifier" phrase and points to the correct place: >>> varname = $d$a$t$apoint File "<stdin>", line 1 varname = $d$a$t$apoint ^ SyntaxError: invalid syntax >>> varname = d$a$t$apoint File "<stdin>", line 1 varname = d$a$t$apoint ^ SyntaxError: invalid syntax >>> varname = ^d$a$t$apoint File "<stdin>", line 1 varname = ^d$a$t$apoint ^ SyntaxError: invalid syntax >>> varname = !d$a$t$apoint File "<stdin>", line 1 varname = !d$a$t$apoint ^ SyntaxError: invalid syntax >>> varname = `d$a$t$apoint File "<stdin>", line 1 varname = `d$a$t$apoint ^ SyntaxError: invalid syntax
msg270921 - (view)	Author: Berker Peksag (berker.peksag) *	Date: 2016-07-21 13:26
This looks like a duplicate of issue 2382.
msg270922 - (view)	Author: Chris Angelico (Rosuav) *	Date: 2016-07-21 13:28
The question was raised that there might be a problem with (UTF-8) bytes vs characters, but that's definitely not it - pythonrun.c:1362 UTF-8-decodes the line of source and then gets its character length to use as the new offset. So I don't think this is a duplicate of 2382. (Side point: There appears to be quite a bit of complexity inside the CPython parser to cope with the fact that it does everything in UTF-8 bytes rather than simply decoding to text and lexing that. I presume that's for the sake of efficiency - that it'd be too slow to work through PyUnicode everywhere?) Am looking into the rest.
msg270927 - (view)	Author: Chris Angelico (Rosuav) *	Date: 2016-07-21 14:19
Actually pinpointing the invalid character may be impractical, as there are two boolean situations: either a UnicodeDecodeError (because you had an invalid UTF-8 stream), or PyUnicode_IsIdentifier returns false. Either way, it applies to the whole identifier. So there are a few possibilities, corresponding to the patches I'm attaching. 1) Change the way this one specific error is handled, in tokenizer.c verify_identifier(). If it finds an error, adjust tok->cur to point to the beginning of it. No new failures in test suite. 2) As above, but also change tok->inp, because of this comment in tokenizer.h:31 /* NB If done != E_OK, cur must be == inp!!! */ which I have no idea about the meaning of. This results in truncated error messages, but suggests that method 1 might be breaking an invariant that results in breakage elsewhere. If there are, though, they're not exercised by 'make test', which itself may be a problem. No new test failures. 3) Change the handling of ALL parser errors, in parsetok.c parsetok(), so now they all point to tok->start. Octal literals with 8s or 9s in them now get the caret pointing to the invalid digit, rather than the end of the literal. Unterminated strings point to the opening quote. And some forms of IndentationError now segfault Python. Test suite fails (unsurprisingly). 4) In response to the above segfault, hack it back to the old way of doing things if there's no tok->start. Maybe the condition should be done differently? No new failures in the test suite. I'd ideally like to use the technique from method 3 (either as patch 4 or with some other guard condition). Failing that, can anyone explain the "NB" above, and what ought to be done to comply with it?
msg270929 - (view)	Author: Chris Angelico (Rosuav) *	Date: 2016-07-21 14:22
BTW, here's how a session looks using method 4's change: >>> varname = asdf“d“a”t”apoint File "<stdin>", line 1 varname = asdf“d“a”t”apoint ^ SyntaxError: invalid character in identifier >>> varname = asdf“d“a”t”apoint.evidence File "<stdin>", line 1 varname = asdf“d“a”t”apoint.evidence ^ SyntaxError: invalid character in identifier >>> varname = $d$a$t$apoint File "<stdin>", line 1 varname = $d$a$t$apoint ^ SyntaxError: invalid syntax >>> varname = asdf$d$a$t$apoint File "<stdin>", line 1 varname = asdf$d$a$t$apoint ^ SyntaxError: invalid syntax >>> varname = 0o1234987654 + 123 File "<stdin>", line 1 varname = 0o1234987654 + 123 ^ SyntaxError: invalid syntax >>> varname = "asdf" + "qwer File "<stdin>", line 1 varname = "asdf" + "qwer ^ SyntaxError: EOL while scanning string literal
msg270933 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2016-07-21 15:06
Looking at issue 2382, I agree that's a different problem (I'm seeing the current misbehaviour even though everything is consistently encoded as UTF-8) The main case we're interested in here is the PyUnicode_IsIdentifier one, so if we wanted to do better than "start or end of the token", we could introduce a new internal "_PyUnicode_FindNonIdentifier" that reported the position of the first non-identifier character (or -1 if it's a valid identifier). Unfortunately, I'm not at all familiar with parsetok.c myself (my own work with the code generator has been from the AST on), so I don't have a ready answer for your other questions.
msg270940 - (view)	Author: Chris Angelico (Rosuav) *	Date: 2016-07-21 16:33
Hmm, that'd be curious. The code to do that is actually pretty simple - see attached patch - but actually using that to affect error messages is a bit harder. Is it safe to mess with tok->start?
msg270941 - (view)	Author: Chris Angelico (Rosuav) *	Date: 2016-07-21 16:35
Attached is a combined patch that has the new private function for IsIdentifier, method 4's error handling change, and a bit of glue in the middle to make use of it. The result is a passing test suite (bar test_site which was already failing on my system) and a caret that points to the errant Unicode character, or the beginning of the identifier if it's a decode problem.
msg270946 - (view)	Author: Stephen J. Turnbull (sjt) *	Date: 2016-07-21 17:15
I still think the easiest thing to do would be to make all non-ASCII characters instances of "invalid_character_token", self-delimiting in the same way that operators are. That would automatically point to exactly the right place in the token stream, and requires zero changes to the error handling code. I don't have time to look at the code, but I suspect that you could handle this exactly the same way that ? and $ are handled, and maybe even use the same token type.
msg270986 - (view)	Author: Adam Bartoš (Drekin) *	Date: 2016-07-22 13:30
Maybe this is related: http://bugs.python.org/issue26152.
msg271138 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2016-07-24 08:30
Drekin, agreed, that looks like the same problem. Since this one has draft patches attached, I'll mark the other one as a duplicate.
msg406869 - (view)	Author: Irit Katriel (iritkatriel) *	Date: 2021-11-23 17:42
This seems to have been fixed by now, I get this on 3.11: >>> varname = “d“a”t”apoint File "<stdin>", line 1 varname = “d“a”t”apoint ^ SyntaxError: invalid character '“' (U+201C) >>> varname = “d“a”t”apoint.evidence File "<stdin>", line 1 varname = “d“a”t”apoint.evidence ^ SyntaxError: invalid character '“' (U+201C) >>> varname = “d“a”t”apoint[evidence] File "<stdin>", line 1 varname = “d“a”t”apoint[evidence] ^ SyntaxError: invalid character '“' (U+201C) >>> varname = “d“a”t”apoint(evidence) File "<stdin>", line 1 varname = “d“a”t”apoint(evidence) ^ SyntaxError: invalid character '“' (U+201C)

History
Date	User	Action	Args
2022-04-11 14:58:34	admin	set	github: 71769
2021-12-03 17:16:24	iritkatriel	set	status: pending -> closed stage: needs patch -> resolved
2021-11-23 17:42:44	iritkatriel	set	status: open -> pending nosy: + iritkatriel messages: + msg406869 resolution: fixed
2016-07-24 08:33:41	ncoghlan	link	issue26152 superseder
2016-07-24 08:30:51	ncoghlan	set	messages: + msg271138
2016-07-22 13:30:44	Drekin	set	nosy: + Drekin messages: + msg270986
2016-07-21 17:15:04	sjt	set	nosy: + sjt messages: + msg270946
2016-07-21 16:35:15	Rosuav	set	files: + combined.patch messages: + msg270941
2016-07-21 16:33:21	Rosuav	set	files: + where-did-that-error-come-from.patch messages: + msg270940
2016-07-21 15:06:14	ncoghlan	set	messages: + msg270933
2016-07-21 14:22:41	Rosuav	set	messages: + msg270929
2016-07-21 14:19:36	Rosuav	set	files: + method4-change-all-errors-if-possible.patch
2016-07-21 14:19:22	Rosuav	set	files: + method3-change-all-errors.patch
2016-07-21 14:19:14	Rosuav	set	files: + method2-change-cur-and-inp.patch
2016-07-21 14:19:04	Rosuav	set	files: + method1-change-cur.patch keywords: + patch messages: + msg270927
2016-07-21 13:28:41	Rosuav	set	messages: + msg270922
2016-07-21 13:26:46	berker.peksag	set	nosy: + berker.peksag messages: + msg270921
2016-07-21 11:57:50	Rosuav	set	nosy: + Rosuav
2016-07-21 11:54:14	ncoghlan	create