Message 221066 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	ezio.melotti, lemburg, loewis, taleinat, terry.reedy
Date	2014-06-20.04:08:36
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1403237317.37.0.86990207668.issue21765@psf.upfronthosting.co.za>
In-reply-to

Content
> I'm not sure what the "Other_ID_Start property" mentioned in [1] and > [2] means, though. Can we get someone with more in-depth knowledge of > unicode to help with this? See http://www.unicode.org/reports/tr31/#Backward_Compatibility. Basically they were considered valid ID_Start characters in previous versions of Unicode, but they are no longer valid. I think it's safe to leave them out (perhaps they could/should be removed from the Python parser too), but if you want to add them the list includes only 4 characters (there are 12 more for Other_ID_Continue). > The real question is how to do this fast, since HyperParser does a > lot of these checks. Do you think caching would be a good approach? I think it would be enough to check explicitly for ASCII chars, since most of them will be ASCII anyway. If they are not ASCII you can use unicodedata.category (or .isidentifier() if it does the right thing).

> I'm not sure what the "Other_ID_Start property" mentioned in [1] and
> [2] means, though. Can we get someone with more in-depth knowledge of
> unicode to help with this? 

See http://www.unicode.org/reports/tr31/#Backward_Compatibility.
Basically they were considered valid ID_Start characters in previous versions of Unicode, but they are no longer valid.  I think it's safe to leave them out (perhaps they could/should be removed from the Python parser too), but if you want to add them the list includes only 4 characters (there are 12 more for Other_ID_Continue).

> The real question is how to do this *fast*, since HyperParser does a
> *lot* of these checks. Do you think caching would be a good approach?

I think it would be enough to check explicitly for ASCII chars, since most of them will be ASCII anyway.  If they are not ASCII you can use unicodedata.category (or .isidentifier() if it does the right thing).

History
Date	User	Action	Args
2014-06-20 04:08:37	ezio.melotti	set	recipients: + ezio.melotti, lemburg, loewis, terry.reedy, taleinat
2014-06-20 04:08:37	ezio.melotti	set	messageid: <1403237317.37.0.86990207668.issue21765@psf.upfronthosting.co.za>
2014-06-20 04:08:37	ezio.melotti	link	issue21765 messages
2014-06-20 04:08:37	ezio.melotti	create