Author abarnert
Recipients Drekin, abarnert, ezio.melotti, vstinner
Date 2016-01-19.18:53:51
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1453229632.22.0.311638428984.issue26152@psf.upfronthosting.co.za>
In-reply-to
Content
Ultimately, this is because the tokenizer works byte by byte instead of character by character, as far as possible. Since any byte >= 128 must be part of some non-ASCII character, and the only legal use for non-ASCII characters outside of quotes and comments is as part of an identifier, the tokenizer assumes (see the macros at the top of tokenizer.c, and the top of the again block in tok_get) that any byte >= 128 is part of an identifier, and then checks the whole string with PyUnicode_IsIdentifier at the end.

This actually gives a better error for more visible glyphs, especially ones that look letter-like but aren't in XID_Continue, but it is kind of weird for a few, like non-break space.

If this needs to be fixed, I think the simplest thing is to special-case things: if the first non-valid-identifier character is in category Z, set an error about invalid whitespace instead of invalid identifier character. (This would probably require adding a PyUnicode_CheckIdentifier that, instead of just returning 0 for failure as PyUnicode_IsIdentifier, returns -n for non-identifier character with code point n.)
History
Date User Action Args
2016-01-19 18:53:52abarnertsetrecipients: + abarnert, vstinner, ezio.melotti, Drekin
2016-01-19 18:53:52abarnertsetmessageid: <1453229632.22.0.311638428984.issue26152@psf.upfronthosting.co.za>
2016-01-19 18:53:52abarnertlinkissue26152 messages
2016-01-19 18:53:51abarnertcreate