Message 258722 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	abarnert
Recipients	Drekin, abarnert, ezio.melotti, martin.panter, vstinner
Date	2016-01-20.22:01:43
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1453327304.2.0.254969918971.issue26152@psf.upfronthosting.co.za>
In-reply-to

Content
> Assuming Andrew is correct, it sounds like the tokenizer is treating the NBSP and the “2” as part of the same token, because NBSP is non-ASCII. It's more complicated than that. When you get an invalid character, it splits the token up. So, in this case, you get a separate `ERRORTOKEN` from cols 2-3 and `NUMBER` token from cols 3-4. Even in the case of `1, a\xa0\xa02`, you get a `NAME` token for the `a`, a separate `ERRORTOKEN` for each nbsp, and a `NUMBER` token for the `2`. But I think the code that generates the `SyntaxError` must be trying to re-generate the "intended token" from the broken one. For example: >>> eval('1\xa0\xa02a') File "<string>", line 1 1 2a ^ SyntaxError: invalid character in identifier And if you capture the error and look at it, `e.args[1][1:3]` is 1, 5, which matches what you see. But if you tokenize it (e.g., `list(tokenize.tokenize(io.BytesIO('1\xa0\xa02a'.encode('utf-8')).readline))`, but you'll probably want to wrap that up in a function if you're playing with it a lot...), you get a `NUMBER` from 0-1, an `ERRORTOKEN` from 1-2, another `ERRORTOKEN` from 2-3, a `NUMBER` from 3-4, and a `NAME` from 4-5. So, why does the `SyntaxError` point at the `NAME` instead of the first `ERRORTOKEN`? Presumably there's some logic that tries to work out that the two `ERRORTOKEN`s, `NUMBER`, and `NAME` were all intended to be one big identifier and points at that instead.

> Assuming Andrew is correct, it sounds like the tokenizer is treating the NBSP and the “2” as part of the same token, because NBSP is non-ASCII.

It's more complicated than that. When you get an invalid character, it splits the token up. So, in this case, you get a separate `ERRORTOKEN` from cols 2-3 and `NUMBER` token from cols 3-4. Even in the case of `1, a\xa0\xa02`, you get a `NAME` token for the `a`, a separate `ERRORTOKEN` for each nbsp, and a `NUMBER` token for the `2`.

But I think the code that generates the `SyntaxError` must be trying to re-generate the "intended token" from the broken one. For example:

    >>> eval('1\xa0\xa02a')
    File "<string>", line 1
      1  2a
          ^
    SyntaxError: invalid character in identifier

And if you capture the error and look at it, `e.args[1][1:3]` is 1, 5, which matches what you see.

But if you tokenize it (e.g., `list(tokenize.tokenize(io.BytesIO('1\xa0\xa02a'.encode('utf-8')).readline))`, but you'll probably want to wrap that up in a function if you're playing with it a lot...), you get a `NUMBER` from 0-1, an `ERRORTOKEN` from 1-2, another `ERRORTOKEN` from 2-3, a `NUMBER` from 3-4, and a `NAME` from 4-5. So, why does the `SyntaxError` point at the `NAME` instead of the first `ERRORTOKEN`? Presumably there's some logic that tries to work out that the two `ERRORTOKEN`s, `NUMBER`, and `NAME` were all intended to be one big identifier and points at that instead.

History
Date	User	Action	Args
2016-01-20 22:01:44	abarnert	set	recipients: + abarnert, vstinner, ezio.melotti, martin.panter, Drekin
2016-01-20 22:01:44	abarnert	set	messageid: <1453327304.2.0.254969918971.issue26152@psf.upfronthosting.co.za>
2016-01-20 22:01:44	abarnert	link	issue26152 messages
2016-01-20 22:01:43	abarnert	create