Message258616
Ultimately, this is because the tokenizer works byte by byte instead of character by character, as far as possible. Since any byte >= 128 must be part of some non-ASCII character, and the only legal use for non-ASCII characters outside of quotes and comments is as part of an identifier, the tokenizer assumes (see the macros at the top of tokenizer.c, and the top of the again block in tok_get) that any byte >= 128 is part of an identifier, and then checks the whole string with PyUnicode_IsIdentifier at the end.
This actually gives a better error for more visible glyphs, especially ones that look letter-like but aren't in XID_Continue, but it is kind of weird for a few, like non-break space.
If this needs to be fixed, I think the simplest thing is to special-case things: if the first non-valid-identifier character is in category Z, set an error about invalid whitespace instead of invalid identifier character. (This would probably require adding a PyUnicode_CheckIdentifier that, instead of just returning 0 for failure as PyUnicode_IsIdentifier, returns -n for non-identifier character with code point n.) |
|
Date |
User |
Action |
Args |
2016-01-19 18:53:52 | abarnert | set | recipients:
+ abarnert, vstinner, ezio.melotti, Drekin |
2016-01-19 18:53:52 | abarnert | set | messageid: <1453229632.22.0.311638428984.issue26152@psf.upfronthosting.co.za> |
2016-01-19 18:53:52 | abarnert | link | issue26152 messages |
2016-01-19 18:53:51 | abarnert | create | |
|