This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Elizacat
Recipients Elizacat, PanderMusubi, akitada, benjamin.peterson, doerwalter, ezio.melotti, lemburg, loewis, pitrou, vstinner
Date 2014-09-02.08:22:50
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1409646171.59.0.246904085475.issue6331@psf.upfronthosting.co.za>
In-reply-to
Content
> I think this needs to be fixed, then - we need to study why there are
> so many new records (e.g. what script contributes most new records),
> and then look for alternatives.

The "Common" script appears to be very fragmented and may be the cause of the issues.

> One alternative could be to create a separate Trie for scripts.

Not having seen the one in C yet, I have one written in Python, custom-made for storing the script database, based on the general idea of a range tree. It stores ranges individually straight out of Scripts.txt. The general idea is you take the average of the lower and upper bounds of a given range (they can be equal). When searching, you compare the codepoint value to the average in the present node, and use that to find which direction to search the tree in.

Without coalescing neighbouring ranges that are the same script, I have 1,606 nodes in the tree (for Unicode 7.0, which added a lot of scripts). After coalescing, there appear to be 806 nodes.

If anyone cares, I'll be more than happy to post code for inspection.

> I don't know what this will be used for, but one application is
> certainly regular expressions. So we need an efficient test whether
> the character is in the expected script or not. It would be bad if
> such a test would have to do a .lower() on each lookup.

This is actually required for restriction-level detection as described in Unicode TR39, for all levels of restriction above ASCII-only (http://www.unicode.org/reports/tr39/#Restriction_Level_Detection).
History
Date User Action Args
2014-09-02 08:22:51Elizacatsetrecipients: + Elizacat, lemburg, loewis, doerwalter, pitrou, vstinner, benjamin.peterson, ezio.melotti, akitada, PanderMusubi
2014-09-02 08:22:51Elizacatsetmessageid: <1409646171.59.0.246904085475.issue6331@psf.upfronthosting.co.za>
2014-09-02 08:22:51Elizacatlinkissue6331 messages
2014-09-02 08:22:50Elizacatcreate