Issue6331
Created on 2009-06-23 20:50 by doerwalter, last changed 2009-07-24 09:44 by ezio.melotti.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | Remove |
| unicode-script.diff | doerwalter, 2009-06-23 20:50 | |||
| unicode-script-2.diff | doerwalter, 2009-06-24 18:56 | |||
| unicode-script-3.diff | doerwalter, 2009-07-01 10:55 | |||
| Messages (6) | |||
|---|---|---|---|
| msg89642 - (view) | Author: Walter Dörwald (doerwalter) | Date: 2009-06-23 20:50 | |
This patch adds a function unicodedata.script() that returns information about the script of the Unicode character. |
|||
| msg89647 - (view) | Author: Martin v. Löwis (loewis) | Date: 2009-06-23 21:35 | |
I think the patch is incorrect: the default value for the script property ought to be Unknown, not Common (despite UCD.html saying the contrary; see UTR#24 and Scripts.txt). I'm puzzled why you use a hard-coded list of script names. The set of scripts will certainly change across Unicode versions, and I think it would be better to learn the script names from Scripts.txt. Out of curiosity: how does the addition of the script property affect the number of distinct database records, and the total size of the database? I think a common application would be lower-cases script names, for more efficient comparison; UCD has also changed the spelling of the script names over time (from being all-capital before). So I propose that a) two functions are provided: one with the original script names, and one with the lower-case script names b) keep cached versions of interned script name strings in separate arrays, to avoid PyString_FromString every time. I'm doubtful that script names need to be provided for old database versions, so I would be happy to not record the script for old versions, and raise an exception if somebody tries to get the script for an old database version - surely applications of the old database records won't be accessing the script property, anyway. |
|||
| msg89671 - (view) | Author: Walter Dörwald (doerwalter) | Date: 2009-06-24 18:55 | |
Martin v. Löwis wrote: > Martin v. Löwis <martin@v.loewis.de> added the comment: > > I think the patch is incorrect: the default value for the script > property ought to be Unknown, not Common (despite UCD.html saying the > contrary; see UTR#24 and Scripts.txt). Fixed. > I'm puzzled why you use a hard-coded list of script names. The set of > scripts will certainly change across Unicode versions, and I think it > would be better to learn the script names from Scripts.txt. I hardcoded the list, because I saw no easy way to get the indexes consistent across both versions of the database. > Out of curiosity: how does the addition of the script property affect > the number of distinct database records, and the total size of the database? I'm not exactly sure how to measure this, but the length of _PyUnicode_Database_Records goes from 229 entries to 690 entries. If it's any help I can post the output of makeunicodedata.py. > I think a common application would be lower-cases script names, for more > efficient comparison; UCD has also changed the spelling of the script > names over time (from being all-capital before). So I propose that > a) two functions are provided: one with the original script names, and > one with the lower-case script names It this really neccessary, if we only have one version of the database? > b) keep cached versions of interned script name strings in separate > arrays, to avoid PyString_FromString every time. Implemented. > I'm doubtful that script names need to be provided for old database > versions, so I would be happy to not record the script for old versions, > and raise an exception if somebody tries to get the script for an old > database version - surely applications of the old database records won't > be accessing the script property, anyway. OK, I've removed the script_changes info for the old database. (And with this change the list of script names is no longer hardcoded). Here's a new version of the patch (unicode-script-2.diff). |
|||
| msg89675 - (view) | Author: Martin v. Löwis (loewis) | Date: 2009-06-24 19:31 | |
>> I'm puzzled why you use a hard-coded list of script names. The set of
>> scripts will certainly change across Unicode versions, and I think it
>> would be better to learn the script names from Scripts.txt.
>
> I hardcoded the list, because I saw no easy way to get the indexes
> consistent across both versions of the database.
Couldn't you have a global cache, something like
scripts = ['Unknown']
def findscript(script):
try:
return scripts.index(script)
except ValueError:
scripts.append(script)
return len(scripts)-1
>> Out of curiosity: how does the addition of the script property affect
>> the number of distinct database records, and the total size of the database?
>
> I'm not exactly sure how to measure this, but the length of
> _PyUnicode_Database_Records goes from 229 entries to 690 entries.
I think this needs to be fixed, then - we need to study why there are
so many new records (e.g. what script contributes most new records),
and then look for alternatives.
One alternative could be to create a separate Trie for scripts.
I'd also be curious if we can increase the homogeneity of scripts
(i.e. produce longer runs of equal scripts) if we declare that
unassigned code points have the script that corresponds to the block
(i.e. the script that surrounding characters have), and then only
change it to "Unknown" at lookup time if it's unassigned.
> If it's any help I can post the output of makeunicodedata.py.
I'd be interested in "size unicodedata.so", and how it changes.
Perhaps the actual size increase isn't that bad.
>> a) two functions are provided: one with the original script names, and
>> one with the lower-case script names
>
> It this really neccessary, if we only have one version of the database?
I don't know what this will be used for, but one application is
certainly regular expressions. So we need an efficient test whether
the character is in the expected script or not. It would be bad if
such a test would have to do a .lower() on each lookup.
|
|||
| msg89701 - (view) | Author: Walter Dörwald (doerwalter) | Date: 2009-06-25 09:14 | |
I was comparing apples and oranges: The 229 entries for the trunk where for an UCS2 build (the patched version was UCS4), with UCS4 there are 317 entries for the trunk. size unicodedata.o gives: __TEXT __DATA __OBJC others dec hex 13622 587057 0 23811 624490 9876a for trunk and __TEXT __DATA __OBJC others dec hex 17769 588817 0 24454 631040 9a100 for the patched version. |
|||
| msg89973 - (view) | Author: Walter Dörwald (doerwalter) | Date: 2009-07-01 10:54 | |
Here is a new version that includes a new function scriptl() that returns the script name in lowercase. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2009-07-24 09:44:35 | ezio.melotti | set | keywords:
+ needs review stage: patch review |
| 2009-07-01 10:55:20 | doerwalter | set | files:
+ unicode-script-3.diff messages: + msg89973 |
| 2009-06-25 09:14:17 | doerwalter | set | messages: + msg89701 |
| 2009-06-24 19:31:08 | loewis | set | messages: + msg89675 |
| 2009-06-24 18:56:52 | doerwalter | set | files: + unicode-script-2.diff |
| 2009-06-24 18:56:05 | doerwalter | set | messages: + msg89671 |
| 2009-06-24 06:36:17 | ezio.melotti | set | priority: normal nosy: + ezio.melotti |
| 2009-06-23 22:02:04 | akitada | set | nosy:
+ akitada |
| 2009-06-23 21:36:00 | loewis | set | nosy:
+ loewis messages: + msg89647 |
| 2009-06-23 20:50:57 | doerwalter | create | |