Message 270933 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ncoghlan
Recipients	Rosuav, berker.peksag, ncoghlan
Date	2016-07-21.15:06:14
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1469113574.75.0.991325182412.issue27582@psf.upfronthosting.co.za>
In-reply-to

Content
Looking at issue 2382, I agree that's a different problem (I'm seeing the current misbehaviour even though everything is consistently encoded as UTF-8) The main case we're interested in here is the PyUnicode_IsIdentifier one, so if we wanted to do better than "start or end of the token", we could introduce a new internal "_PyUnicode_FindNonIdentifier" that reported the position of the first non-identifier character (or -1 if it's a valid identifier). Unfortunately, I'm not at all familiar with parsetok.c myself (my own work with the code generator has been from the AST on), so I don't have a ready answer for your other questions.

Looking at issue 2382, I agree that's a different problem (I'm seeing the current misbehaviour even though everything is consistently encoded as UTF-8)

The main case we're interested in here is the PyUnicode_IsIdentifier one, so if we wanted to do better than "start or end of the token", we could introduce a new internal "_PyUnicode_FindNonIdentifier" that reported the position of the first non-identifier character (or -1 if it's a valid identifier).

Unfortunately, I'm not at all familiar with parsetok.c myself (my own work with the code generator has been from the AST on), so I don't have a ready answer for your other questions.

History
Date	User	Action	Args
2016-07-21 15:06:14	ncoghlan	set	recipients: + ncoghlan, Rosuav, berker.peksag
2016-07-21 15:06:14	ncoghlan	set	messageid: <1469113574.75.0.991325182412.issue27582@psf.upfronthosting.co.za>
2016-07-21 15:06:14	ncoghlan	link	issue27582 messages
2016-07-21 15:06:14	ncoghlan	create