msg270914 - (view) |
Author: Nick Coghlan (ncoghlan) * |
Date: 2016-07-21 11:54 |
Reporting by Rustom Mody on python-ideas, the SyntaxError caret is confusingly mispositioned when an invalid Unicode codepoint appears as part of a larger sequence of invalid codepoints and/or valid identifier characters:
>>> varname = “d“a”t”apoint
File "<stdin>", line 1
varname = “d“a”t”apoint
^
SyntaxError: invalid character in identifier
>>> varname = “d“a”t”apoint.evidence
File "<stdin>", line 1
varname = “d“a”t”apoint.evidence
^
SyntaxError: invalid character in identifier
>>> varname = “d“a”t”apoint[evidence]
File "<stdin>", line 1
varname = “d“a”t”apoint[evidence]
^
SyntaxError: invalid character in identifier
>>> varname = “d“a”t”apoint(evidence)
File "<stdin>", line 1
varname = “d“a”t”apoint(evidence)
^
SyntaxError: invalid character in identifier
If the invalid character is a non-identifiers ASCII character, the error message loses the trailing "in identifier" phrase and points to the correct place:
>>> varname = $d$a$t$apoint
File "<stdin>", line 1
varname = $d$a$t$apoint
^
SyntaxError: invalid syntax
>>> varname = d$a$t$apoint
File "<stdin>", line 1
varname = d$a$t$apoint
^
SyntaxError: invalid syntax
>>> varname = ^d$a$t$apoint
File "<stdin>", line 1
varname = ^d$a$t$apoint
^
SyntaxError: invalid syntax
>>> varname = !d$a$t$apoint
File "<stdin>", line 1
varname = !d$a$t$apoint
^
SyntaxError: invalid syntax
>>> varname = `d$a$t$apoint
File "<stdin>", line 1
varname = `d$a$t$apoint
^
SyntaxError: invalid syntax
|
msg270921 - (view) |
Author: Berker Peksag (berker.peksag) * |
Date: 2016-07-21 13:26 |
This looks like a duplicate of issue 2382.
|
msg270922 - (view) |
Author: Chris Angelico (Rosuav) * |
Date: 2016-07-21 13:28 |
The question was raised that there might be a problem with (UTF-8) bytes vs characters, but that's definitely not it - pythonrun.c:1362 UTF-8-decodes the line of source and then gets its character length to use as the new offset. So I don't think this is a duplicate of 2382.
(Side point: There appears to be quite a bit of complexity inside the CPython parser to cope with the fact that it does everything in UTF-8 bytes rather than simply decoding to text and lexing that. I presume that's for the sake of efficiency - that it'd be too slow to work through PyUnicode everywhere?)
Am looking into the rest.
|
msg270927 - (view) |
Author: Chris Angelico (Rosuav) * |
Date: 2016-07-21 14:19 |
Actually pinpointing the invalid character may be impractical, as there are two boolean situations: either a UnicodeDecodeError (because you had an invalid UTF-8 stream), or PyUnicode_IsIdentifier returns false. Either way, it applies to the whole identifier. So there are a few possibilities, corresponding to the patches I'm attaching.
1) Change the way this one specific error is handled, in tokenizer.c verify_identifier(). If it finds an error, adjust tok->cur to point to the beginning of it. No new failures in test suite.
2) As above, but also change tok->inp, because of this comment in tokenizer.h:31 /* NB If done != E_OK, cur must be == inp!!! */ which I have no idea about the meaning of. This results in truncated error messages, but suggests that method 1 might be breaking an invariant that results in breakage elsewhere. If there are, though, they're not exercised by 'make test', which itself may be a problem. No new test failures.
3) Change the handling of ALL parser errors, in parsetok.c parsetok(), so now they all point to tok->start. Octal literals with 8s or 9s in them now get the caret pointing to the invalid digit, rather than the end of the literal. Unterminated strings point to the opening quote. And some forms of IndentationError now segfault Python. Test suite fails (unsurprisingly).
4) In response to the above segfault, hack it back to the old way of doing things if there's no tok->start. Maybe the condition should be done differently? No new failures in the test suite.
I'd ideally like to use the technique from method 3 (either as patch 4 or with some other guard condition). Failing that, can anyone explain the "NB" above, and what ought to be done to comply with it?
|
msg270929 - (view) |
Author: Chris Angelico (Rosuav) * |
Date: 2016-07-21 14:22 |
BTW, here's how a session looks using method 4's change:
>>> varname = asdf“d“a”t”apoint
File "<stdin>", line 1
varname = asdf“d“a”t”apoint
^
SyntaxError: invalid character in identifier
>>> varname = asdf“d“a”t”apoint.evidence
File "<stdin>", line 1
varname = asdf“d“a”t”apoint.evidence
^
SyntaxError: invalid character in identifier
>>> varname = $d$a$t$apoint
File "<stdin>", line 1
varname = $d$a$t$apoint
^
SyntaxError: invalid syntax
>>> varname = asdf$d$a$t$apoint
File "<stdin>", line 1
varname = asdf$d$a$t$apoint
^
SyntaxError: invalid syntax
>>> varname = 0o1234987654 + 123
File "<stdin>", line 1
varname = 0o1234987654 + 123
^
SyntaxError: invalid syntax
>>> varname = "asdf" + "qwer
File "<stdin>", line 1
varname = "asdf" + "qwer
^
SyntaxError: EOL while scanning string literal
|
msg270933 - (view) |
Author: Nick Coghlan (ncoghlan) * |
Date: 2016-07-21 15:06 |
Looking at issue 2382, I agree that's a different problem (I'm seeing the current misbehaviour even though everything is consistently encoded as UTF-8)
The main case we're interested in here is the PyUnicode_IsIdentifier one, so if we wanted to do better than "start or end of the token", we could introduce a new internal "_PyUnicode_FindNonIdentifier" that reported the position of the first non-identifier character (or -1 if it's a valid identifier).
Unfortunately, I'm not at all familiar with parsetok.c myself (my own work with the code generator has been from the AST on), so I don't have a ready answer for your other questions.
|
msg270940 - (view) |
Author: Chris Angelico (Rosuav) * |
Date: 2016-07-21 16:33 |
Hmm, that'd be curious. The code to do that is actually pretty simple - see attached patch - but actually using that to affect error messages is a bit harder. Is it safe to mess with tok->start?
|
msg270941 - (view) |
Author: Chris Angelico (Rosuav) * |
Date: 2016-07-21 16:35 |
Attached is a combined patch that has the new private function for IsIdentifier, method 4's error handling change, and a bit of glue in the middle to make use of it. The result is a passing test suite (bar test_site which was already failing on my system) and a caret that points to the errant Unicode character, or the beginning of the identifier if it's a decode problem.
|
msg270946 - (view) |
Author: Stephen J. Turnbull (sjt) * |
Date: 2016-07-21 17:15 |
I still think the easiest thing to do would be to make all non-ASCII characters instances of "invalid_character_token", self-delimiting in the same way that operators are. That would automatically point to exactly the right place in the token stream, and requires zero changes to the error handling code.
I don't have time to look at the code, but I suspect that you could handle this exactly the same way that ? and $ are handled, and maybe even use the same token type.
|
msg270986 - (view) |
Author: Adam Bartoš (Drekin) * |
Date: 2016-07-22 13:30 |
Maybe this is related: http://bugs.python.org/issue26152.
|
msg271138 - (view) |
Author: Nick Coghlan (ncoghlan) * |
Date: 2016-07-24 08:30 |
Drekin, agreed, that looks like the same problem. Since this one has draft patches attached, I'll mark the other one as a duplicate.
|
msg406869 - (view) |
Author: Irit Katriel (iritkatriel) * |
Date: 2021-11-23 17:42 |
This seems to have been fixed by now, I get this on 3.11:
>>> varname = “d“a”t”apoint
File "<stdin>", line 1
varname = “d“a”t”apoint
^
SyntaxError: invalid character '“' (U+201C)
>>> varname = “d“a”t”apoint.evidence
File "<stdin>", line 1
varname = “d“a”t”apoint.evidence
^
SyntaxError: invalid character '“' (U+201C)
>>> varname = “d“a”t”apoint[evidence]
File "<stdin>", line 1
varname = “d“a”t”apoint[evidence]
^
SyntaxError: invalid character '“' (U+201C)
>>> varname = “d“a”t”apoint(evidence)
File "<stdin>", line 1
varname = “d“a”t”apoint(evidence)
^
SyntaxError: invalid character '“' (U+201C)
|
|
Date |
User |
Action |
Args |
2022-04-11 14:58:34 | admin | set | github: 71769 |
2021-12-03 17:16:24 | iritkatriel | set | status: pending -> closed stage: needs patch -> resolved |
2021-11-23 17:42:44 | iritkatriel | set | status: open -> pending
nosy:
+ iritkatriel messages:
+ msg406869
resolution: fixed |
2016-07-24 08:33:41 | ncoghlan | link | issue26152 superseder |
2016-07-24 08:30:51 | ncoghlan | set | messages:
+ msg271138 |
2016-07-22 13:30:44 | Drekin | set | nosy:
+ Drekin messages:
+ msg270986
|
2016-07-21 17:15:04 | sjt | set | nosy:
+ sjt messages:
+ msg270946
|
2016-07-21 16:35:15 | Rosuav | set | files:
+ combined.patch
messages:
+ msg270941 |
2016-07-21 16:33:21 | Rosuav | set | files:
+ where-did-that-error-come-from.patch
messages:
+ msg270940 |
2016-07-21 15:06:14 | ncoghlan | set | messages:
+ msg270933 |
2016-07-21 14:22:41 | Rosuav | set | messages:
+ msg270929 |
2016-07-21 14:19:36 | Rosuav | set | files:
+ method4-change-all-errors-if-possible.patch |
2016-07-21 14:19:22 | Rosuav | set | files:
+ method3-change-all-errors.patch |
2016-07-21 14:19:14 | Rosuav | set | files:
+ method2-change-cur-and-inp.patch |
2016-07-21 14:19:04 | Rosuav | set | files:
+ method1-change-cur.patch keywords:
+ patch messages:
+ msg270927
|
2016-07-21 13:28:41 | Rosuav | set | messages:
+ msg270922 |
2016-07-21 13:26:46 | berker.peksag | set | nosy:
+ berker.peksag messages:
+ msg270921
|
2016-07-21 11:57:50 | Rosuav | set | nosy:
+ Rosuav
|
2016-07-21 11:54:14 | ncoghlan | create | |