Author abarnert
Recipients Drekin, abarnert, ezio.melotti, martin.panter, vstinner
Date 2016-01-20.22:01:43
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1453327304.2.0.254969918971.issue26152@psf.upfronthosting.co.za>
In-reply-to
Content
> Assuming Andrew is correct, it sounds like the tokenizer is treating the NBSP and the “2” as part of the same token, because NBSP is non-ASCII.

It's more complicated than that. When you get an invalid character, it splits the token up. So, in this case, you get a separate `ERRORTOKEN` from cols 2-3 and `NUMBER` token from cols 3-4. Even in the case of `1, a\xa0\xa02`, you get a `NAME` token for the `a`, a separate `ERRORTOKEN` for each nbsp, and a `NUMBER` token for the `2`.

But I think the code that generates the `SyntaxError` must be trying to re-generate the "intended token" from the broken one. For example:

    >>> eval('1\xa0\xa02a')
    File "<string>", line 1
      1  2a
          ^
    SyntaxError: invalid character in identifier

And if you capture the error and look at it, `e.args[1][1:3]` is 1, 5, which matches what you see.

But if you tokenize it (e.g., `list(tokenize.tokenize(io.BytesIO('1\xa0\xa02a'.encode('utf-8')).readline))`, but you'll probably want to wrap that up in a function if you're playing with it a lot...), you get a `NUMBER` from 0-1, an `ERRORTOKEN` from 1-2, another `ERRORTOKEN` from 2-3, a `NUMBER` from 3-4, and a `NAME` from 4-5. So, why does the `SyntaxError` point at the `NAME` instead of the first `ERRORTOKEN`? Presumably there's some logic that tries to work out that the two `ERRORTOKEN`s, `NUMBER`, and `NAME` were all intended to be one big identifier and points at that instead.
History
Date User Action Args
2016-01-20 22:01:44abarnertsetrecipients: + abarnert, vstinner, ezio.melotti, martin.panter, Drekin
2016-01-20 22:01:44abarnertsetmessageid: <1453327304.2.0.254969918971.issue26152@psf.upfronthosting.co.za>
2016-01-20 22:01:44abarnertlinkissue26152 messages
2016-01-20 22:01:43abarnertcreate