This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author loewis
Recipients ezio.melotti, lemburg, loewis, taleinat, terry.reedy
Date 2014-06-22.08:53:56
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1403427237.29.0.0173457824687.issue21765@psf.upfronthosting.co.za>
In-reply-to
Content
The reason the Unicode consortium made this list (Other_ID_Start) is that they want to promise 100% backwards compatibility: if some programming language had been using UAX#31, changes to the Unicode database might break existing code. To avoid this, UAX#31 guarantees 100% stability.

The reason Python uses it is because it uses UAX#31, with the minimum number of modifications. We really shouldn't be making arbitrary changes to it. If we would e.g. say that we drop these four characters now, the next Unicode version might add more characters to Other_ID_Start, and then we would have to say that we include some, but not all, characters from Other_ID_Start.

So if IDLE wants to reimplement the XID_Start and XID_Continue properties, it should do it correctly. Note that the proposed patch only manages to replicate the ID_Start and ID_Continue properties. For the XID versions, see

http://www.unicode.org/reports/tr31/#NFKC_Modifications

Unfortunately, the specification doesn't explain exactly how these modifications are performed. For item 1, I think it is:

Characters which are in ID_Start (because they count as letters) but their NFKC decomposition does not start with an ID_Start character (because it starts with a modifier instead) are removed in XID_Start

For item 2, they unfortunately don't list all characters that get excluded. For the one example that they do give, the reason is clear: U+037A (GREEK YPOGEGRAMMENI, category Lm) decomposes to U+0020 (SPACE) U+0345 (COMBINING GREEK YPOGEGRAMMENI). Having a space in an identifier is clearly out of the question. I assume similar problems occur with "certain Arabic presentation forms". I wish the consortium was more explicit as to what precise algorithms they use to derive their derived properties.
History
Date User Action Args
2014-06-22 08:53:57loewissetrecipients: + loewis, lemburg, terry.reedy, taleinat, ezio.melotti
2014-06-22 08:53:57loewissetmessageid: <1403427237.29.0.0173457824687.issue21765@psf.upfronthosting.co.za>
2014-06-22 08:53:57loewislinkissue21765 messages
2014-06-22 08:53:56loewiscreate