Message 258616 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	abarnert
Recipients	Drekin, abarnert, ezio.melotti, vstinner
Date	2016-01-19.18:53:51
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1453229632.22.0.311638428984.issue26152@psf.upfronthosting.co.za>
In-reply-to

Content
Ultimately, this is because the tokenizer works byte by byte instead of character by character, as far as possible. Since any byte >= 128 must be part of some non-ASCII character, and the only legal use for non-ASCII characters outside of quotes and comments is as part of an identifier, the tokenizer assumes (see the macros at the top of tokenizer.c, and the top of the again block in tok_get) that any byte >= 128 is part of an identifier, and then checks the whole string with PyUnicode_IsIdentifier at the end. This actually gives a better error for more visible glyphs, especially ones that look letter-like but aren't in XID_Continue, but it is kind of weird for a few, like non-break space. If this needs to be fixed, I think the simplest thing is to special-case things: if the first non-valid-identifier character is in category Z, set an error about invalid whitespace instead of invalid identifier character. (This would probably require adding a PyUnicode_CheckIdentifier that, instead of just returning 0 for failure as PyUnicode_IsIdentifier, returns -n for non-identifier character with code point n.)

Ultimately, this is because the tokenizer works byte by byte instead of character by character, as far as possible. Since any byte >= 128 must be part of some non-ASCII character, and the only legal use for non-ASCII characters outside of quotes and comments is as part of an identifier, the tokenizer assumes (see the macros at the top of tokenizer.c, and the top of the again block in tok_get) that any byte >= 128 is part of an identifier, and then checks the whole string with PyUnicode_IsIdentifier at the end.

This actually gives a better error for more visible glyphs, especially ones that look letter-like but aren't in XID_Continue, but it is kind of weird for a few, like non-break space.

If this needs to be fixed, I think the simplest thing is to special-case things: if the first non-valid-identifier character is in category Z, set an error about invalid whitespace instead of invalid identifier character. (This would probably require adding a PyUnicode_CheckIdentifier that, instead of just returning 0 for failure as PyUnicode_IsIdentifier, returns -n for non-identifier character with code point n.)

History
Date	User	Action	Args
2016-01-19 18:53:52	abarnert	set	recipients: + abarnert, vstinner, ezio.melotti, Drekin
2016-01-19 18:53:52	abarnert	set	messageid: <1453229632.22.0.311638428984.issue26152@psf.upfronthosting.co.za>
2016-01-19 18:53:52	abarnert	link	issue26152 messages
2016-01-19 18:53:51	abarnert	create