Title: A non-breaking space in a source
Type: behavior Stage:
Components: Interpreter Core, Unicode Versions: Python 3.6
Status: closed Resolution: duplicate
Dependencies: Superseder: Mispositioned SyntaxError caret for unknown code points
View: 27582
Assigned To: Nosy List: Drekin, abarnert, ezio.melotti, martin.panter, ncoghlan, vstinner
Priority: normal Keywords:

Created on 2016-01-19 12:01 by Drekin, last changed 2016-07-24 08:33 by ncoghlan. This issue is now closed.

Messages (9)
msg258584 - (view) Author: Adam Bartoš (Drekin) * Date: 2016-01-19 12:01
Consider the following code:
>>> 1, 2
  File "<stdin>", line 1
    1, 2
SyntaxError: invalid character in identifier

The error is due to the fact, that the space before "2" is actually a non-breaking space. The error message and the position of the caret is misleading.

The tokenize module gives an ERRORTOKEN at the position of the space, so shouldn't the massage be more like "invalid syntax" with the correct position or even something more appropriate?
msg258616 - (view) Author: Andrew Barnert (abarnert) * Date: 2016-01-19 18:53
Ultimately, this is because the tokenizer works byte by byte instead of character by character, as far as possible. Since any byte >= 128 must be part of some non-ASCII character, and the only legal use for non-ASCII characters outside of quotes and comments is as part of an identifier, the tokenizer assumes (see the macros at the top of tokenizer.c, and the top of the again block in tok_get) that any byte >= 128 is part of an identifier, and then checks the whole string with PyUnicode_IsIdentifier at the end.

This actually gives a better error for more visible glyphs, especially ones that look letter-like but aren't in XID_Continue, but it is kind of weird for a few, like non-break space.

If this needs to be fixed, I think the simplest thing is to special-case things: if the first non-valid-identifier character is in category Z, set an error about invalid whitespace instead of invalid identifier character. (This would probably require adding a PyUnicode_CheckIdentifier that, instead of just returning 0 for failure as PyUnicode_IsIdentifier, returns -n for non-identifier character with code point n.)
msg258679 - (view) Author: Adam Bartoš (Drekin) * Date: 2016-01-20 13:05
That explains the message. But why is the caret at a wrong place?
msg258711 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-01-20 20:05
The caret always points to the end of the token, I think.
msg258713 - (view) Author: Adam Bartoš (Drekin) * Date: 2016-01-20 20:31
We have one particular invalid token, so why it should point to the next token rather than to the invalid one?
msg258714 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-01-20 20:40
Assuming Andrew is correct, it sounds like the tokenizer is treating the NBSP and the “2” as part of the same token, because NBSP is non-ASCII.
msg258715 - (view) Author: Adam Bartoš (Drekin) * Date: 2016-01-20 20:48
It could still point to the first or the last byte of the invalid token rather than to the start of the next token. Also, by the Python implementation of the tokenizer in tokenize module we get an ERRORTOKEN containing a non-breaking space followed by a number token containing 2.
msg258722 - (view) Author: Andrew Barnert (abarnert) * Date: 2016-01-20 22:01
> Assuming Andrew is correct, it sounds like the tokenizer is treating the NBSP and the “2” as part of the same token, because NBSP is non-ASCII.

It's more complicated than that. When you get an invalid character, it splits the token up. So, in this case, you get a separate `ERRORTOKEN` from cols 2-3 and `NUMBER` token from cols 3-4. Even in the case of `1, a\xa0\xa02`, you get a `NAME` token for the `a`, a separate `ERRORTOKEN` for each nbsp, and a `NUMBER` token for the `2`.

But I think the code that generates the `SyntaxError` must be trying to re-generate the "intended token" from the broken one. For example:

    >>> eval('1\xa0\xa02a')
    File "<string>", line 1
      1  2a
    SyntaxError: invalid character in identifier

And if you capture the error and look at it, `e.args[1][1:3]` is 1, 5, which matches what you see.

But if you tokenize it (e.g., `list(tokenize.tokenize(io.BytesIO('1\xa0\xa02a'.encode('utf-8')).readline))`, but you'll probably want to wrap that up in a function if you're playing with it a lot...), you get a `NUMBER` from 0-1, an `ERRORTOKEN` from 1-2, another `ERRORTOKEN` from 2-3, a `NUMBER` from 3-4, and a `NAME` from 4-5. So, why does the `SyntaxError` point at the `NAME` instead of the first `ERRORTOKEN`? Presumably there's some logic that tries to work out that the two `ERRORTOKEN`s, `NUMBER`, and `NAME` were all intended to be one big identifier and points at that instead.
msg271139 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-07-24 08:33 is a later mention of the same problem that attracted patches before Adam noticed it was a repeat of this issue.

Marking this as the duplicate, since the problem applies to more than just Unicode whitespace, and the problems being discussed there should also help with this subproblem.
Date User Action Args
2016-07-24 08:33:56ncoghlansetstatus: open -> closed
resolution: duplicate
2016-07-24 08:33:41ncoghlansetsuperseder: Mispositioned SyntaxError caret for unknown code points

messages: + msg271139
nosy: + ncoghlan
2016-01-20 22:01:44abarnertsetmessages: + msg258722
2016-01-20 20:48:07Drekinsetmessages: + msg258715
2016-01-20 20:40:13martin.pantersetmessages: + msg258714
2016-01-20 20:31:52Drekinsetmessages: + msg258713
2016-01-20 20:05:58martin.pantersetnosy: + martin.panter
messages: + msg258711
2016-01-20 13:05:51Drekinsetmessages: + msg258679
2016-01-19 18:53:52abarnertsetnosy: + abarnert
messages: + msg258616
2016-01-19 12:01:37Drekincreate