Author ezio.melotti
Recipients ezio.melotti, lemburg, loewis, taleinat, terry.reedy
Date 2014-06-20.04:08:36
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1403237317.37.0.86990207668.issue21765@psf.upfronthosting.co.za>
In-reply-to
Content
> I'm not sure what the "Other_ID_Start property" mentioned in [1] and
> [2] means, though. Can we get someone with more in-depth knowledge of
> unicode to help with this? 

See http://www.unicode.org/reports/tr31/#Backward_Compatibility.
Basically they were considered valid ID_Start characters in previous versions of Unicode, but they are no longer valid.  I think it's safe to leave them out (perhaps they could/should be removed from the Python parser too), but if you want to add them the list includes only 4 characters (there are 12 more for Other_ID_Continue).

> The real question is how to do this *fast*, since HyperParser does a
> *lot* of these checks. Do you think caching would be a good approach?

I think it would be enough to check explicitly for ASCII chars, since most of them will be ASCII anyway.  If they are not ASCII you can use unicodedata.category (or .isidentifier() if it does the right thing).
History
Date User Action Args
2014-06-20 04:08:37ezio.melottisetrecipients: + ezio.melotti, lemburg, loewis, terry.reedy, taleinat
2014-06-20 04:08:37ezio.melottisetmessageid: <1403237317.37.0.86990207668.issue21765@psf.upfronthosting.co.za>
2014-06-20 04:08:37ezio.melottilinkissue21765 messages
2014-06-20 04:08:37ezio.melotticreate