classification
Title: Mispositioned SyntaxError caret for unknown code points
Type: behavior Stage: needs patch
Components: Versions: Python 3.6, Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Drekin, Rosuav, berker.peksag, ncoghlan, sjt
Priority: normal Keywords: patch

Created on 2016-07-21 11:54 by ncoghlan, last changed 2016-07-24 08:30 by ncoghlan.

Files
File name Uploaded Description Edit
method1-change-cur.patch Rosuav, 2016-07-21 14:19 review
method2-change-cur-and-inp.patch Rosuav, 2016-07-21 14:19 review
method3-change-all-errors.patch Rosuav, 2016-07-21 14:19 review
method4-change-all-errors-if-possible.patch Rosuav, 2016-07-21 14:19 review
where-did-that-error-come-from.patch Rosuav, 2016-07-21 16:33 review
combined.patch Rosuav, 2016-07-21 16:35 review
Messages (11)
msg270914 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-07-21 11:54
Reporting by Rustom Mody on python-ideas, the SyntaxError caret is confusingly mispositioned when an invalid Unicode codepoint appears as part of a larger sequence of invalid codepoints and/or valid identifier characters:

>>> varname = “d“a”t”apoint
  File "<stdin>", line 1
    varname = “d“a”t”apoint
                          ^
SyntaxError: invalid character in identifier
>>> varname = “d“a”t”apoint.evidence
  File "<stdin>", line 1
    varname = “d“a”t”apoint.evidence
                          ^
SyntaxError: invalid character in identifier
>>> varname = “d“a”t”apoint[evidence]
  File "<stdin>", line 1
    varname = “d“a”t”apoint[evidence]
                          ^
SyntaxError: invalid character in identifier
>>> varname = “d“a”t”apoint(evidence)
  File "<stdin>", line 1
    varname = “d“a”t”apoint(evidence)
                          ^
SyntaxError: invalid character in identifier

If the invalid character is a non-identifiers ASCII character, the error message loses the trailing "in identifier" phrase and points to the correct place:

>>> varname = $d$a$t$apoint
  File "<stdin>", line 1
    varname = $d$a$t$apoint
              ^
SyntaxError: invalid syntax
>>> varname = d$a$t$apoint
  File "<stdin>", line 1
    varname = d$a$t$apoint
               ^
SyntaxError: invalid syntax
>>> varname = ^d$a$t$apoint
  File "<stdin>", line 1
    varname = ^d$a$t$apoint
              ^
SyntaxError: invalid syntax
>>> varname = !d$a$t$apoint
  File "<stdin>", line 1
    varname = !d$a$t$apoint
              ^
SyntaxError: invalid syntax
>>> varname = `d$a$t$apoint
  File "<stdin>", line 1
    varname = `d$a$t$apoint
              ^
SyntaxError: invalid syntax
msg270921 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2016-07-21 13:26
This looks like a duplicate of issue 2382.
msg270922 - (view) Author: Chris Angelico (Rosuav) * Date: 2016-07-21 13:28
The question was raised that there might be a problem with (UTF-8) bytes vs characters, but that's definitely not it - pythonrun.c:1362 UTF-8-decodes the line of source and then gets its character length to use as the new offset. So I don't think this is a duplicate of 2382.

(Side point: There appears to be quite a bit of complexity inside the CPython parser to cope with the fact that it does everything in UTF-8 bytes rather than simply decoding to text and lexing that. I presume that's for the sake of efficiency - that it'd be too slow to work through PyUnicode everywhere?)

Am looking into the rest.
msg270927 - (view) Author: Chris Angelico (Rosuav) * Date: 2016-07-21 14:19
Actually pinpointing the invalid character may be impractical, as there are two boolean situations: either a UnicodeDecodeError (because you had an invalid UTF-8 stream), or PyUnicode_IsIdentifier returns false. Either way, it applies to the whole identifier. So there are a few possibilities, corresponding to the patches I'm attaching.

1) Change the way this one specific error is handled, in tokenizer.c verify_identifier(). If it finds an error, adjust tok->cur to point to the beginning of it. No new failures in test suite.

2) As above, but also change tok->inp, because of this comment in tokenizer.h:31 /* NB If done != E_OK, cur must be == inp!!! */ which I have no idea about the meaning of. This results in truncated error messages, but suggests that method 1 might be breaking an invariant that results in breakage elsewhere. If there are, though, they're not exercised by 'make test', which itself may be a problem. No new test failures.

3) Change the handling of ALL parser errors, in parsetok.c parsetok(), so now they all point to tok->start. Octal literals with 8s or 9s in them now get the caret pointing to the invalid digit, rather than the end of the literal. Unterminated strings point to the opening quote. And some forms of IndentationError now segfault Python. Test suite fails (unsurprisingly).

4) In response to the above segfault, hack it back to the old way of doing things if there's no tok->start. Maybe the condition should be done differently? No new failures in the test suite.

I'd ideally like to use the technique from method 3 (either as patch 4 or with some other guard condition). Failing that, can anyone explain the "NB" above, and what ought to be done to comply with it?
msg270929 - (view) Author: Chris Angelico (Rosuav) * Date: 2016-07-21 14:22
BTW, here's how a session looks using method 4's change:

>>> varname = asdf“d“a”t”apoint
  File "<stdin>", line 1
    varname = asdf“d“a”t”apoint
              ^
SyntaxError: invalid character in identifier
>>> varname = asdf“d“a”t”apoint.evidence
  File "<stdin>", line 1
    varname = asdf“d“a”t”apoint.evidence
              ^
SyntaxError: invalid character in identifier
>>> varname = $d$a$t$apoint
  File "<stdin>", line 1
    varname = $d$a$t$apoint
              ^
SyntaxError: invalid syntax
>>> varname = asdf$d$a$t$apoint
  File "<stdin>", line 1
    varname = asdf$d$a$t$apoint
                  ^
SyntaxError: invalid syntax
>>> varname = 0o1234987654 + 123
  File "<stdin>", line 1
    varname = 0o1234987654 + 123
                    ^
SyntaxError: invalid syntax
>>> varname = "asdf" + "qwer
  File "<stdin>", line 1
    varname = "asdf" + "qwer
                       ^
SyntaxError: EOL while scanning string literal
msg270933 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-07-21 15:06
Looking at issue 2382, I agree that's a different problem (I'm seeing the current misbehaviour even though everything is consistently encoded as UTF-8)

The main case we're interested in here is the PyUnicode_IsIdentifier one, so if we wanted to do better than "start or end of the token", we could introduce a new internal "_PyUnicode_FindNonIdentifier" that reported the position of the first non-identifier character (or -1 if it's a valid identifier).

Unfortunately, I'm not at all familiar with parsetok.c myself (my own work with the code generator has been from the AST on), so I don't have a ready answer for your other questions.
msg270940 - (view) Author: Chris Angelico (Rosuav) * Date: 2016-07-21 16:33
Hmm, that'd be curious. The code to do that is actually pretty simple - see attached patch - but actually using that to affect error messages is a bit harder. Is it safe to mess with tok->start?
msg270941 - (view) Author: Chris Angelico (Rosuav) * Date: 2016-07-21 16:35
Attached is a combined patch that has the new private function for IsIdentifier, method 4's error handling change, and a bit of glue in the middle to make use of it. The result is a passing test suite (bar test_site which was already failing on my system) and a caret that points to the errant Unicode character, or the beginning of the identifier if it's a decode problem.
msg270946 - (view) Author: Stephen J. Turnbull (sjt) * (Python triager) Date: 2016-07-21 17:15
I still think the easiest thing to do would be to make all non-ASCII characters instances of "invalid_character_token", self-delimiting in the same way that operators are.  That would automatically point to exactly the right place in the token stream, and requires zero changes to the error handling code.

I don't have time to look at the code, but I suspect that you could handle this exactly the same way that ? and $ are handled, and maybe even use the same token type.
msg270986 - (view) Author: Adam Bartoš (Drekin) * Date: 2016-07-22 13:30
Maybe this is related: http://bugs.python.org/issue26152.
msg271138 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-07-24 08:30
Drekin, agreed, that looks like the same problem. Since this one has draft patches attached, I'll mark the other one as a duplicate.
History
Date User Action Args
2016-07-24 08:33:41ncoghlanlinkissue26152 superseder
2016-07-24 08:30:51ncoghlansetmessages: + msg271138
2016-07-22 13:30:44Drekinsetnosy: + Drekin
messages: + msg270986
2016-07-21 17:15:04sjtsetnosy: + sjt
messages: + msg270946
2016-07-21 16:35:15Rosuavsetfiles: + combined.patch

messages: + msg270941
2016-07-21 16:33:21Rosuavsetfiles: + where-did-that-error-come-from.patch

messages: + msg270940
2016-07-21 15:06:14ncoghlansetmessages: + msg270933
2016-07-21 14:22:41Rosuavsetmessages: + msg270929
2016-07-21 14:19:36Rosuavsetfiles: + method4-change-all-errors-if-possible.patch
2016-07-21 14:19:22Rosuavsetfiles: + method3-change-all-errors.patch
2016-07-21 14:19:14Rosuavsetfiles: + method2-change-cur-and-inp.patch
2016-07-21 14:19:04Rosuavsetfiles: + method1-change-cur.patch
keywords: + patch
messages: + msg270927
2016-07-21 13:28:41Rosuavsetmessages: + msg270922
2016-07-21 13:26:46berker.peksagsetnosy: + berker.peksag
messages: + msg270921
2016-07-21 11:57:50Rosuavsetnosy: + Rosuav
2016-07-21 11:54:14ncoghlancreate