This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author doerwalter
Recipients akitada, doerwalter, ezio.melotti, loewis
Date 2009-06-24.18:55:59
SpamBayes Score 0.0
Marked as misclassified No
Message-id <4A4276BA.2070702@livinglogic.de>
In-reply-to <1245792961.5.0.460725375108.issue6331@psf.upfronthosting.co.za>
Content
Martin v. Löwis wrote:
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
> I think the patch is incorrect: the default value for the script
> property ought to be Unknown, not Common (despite UCD.html saying the
> contrary; see UTR#24 and Scripts.txt).

Fixed.

> I'm puzzled why you use a hard-coded list of script names. The set of
> scripts will certainly change across Unicode versions, and I think it
> would be better to learn the script names from Scripts.txt.

I hardcoded the list, because I saw no easy way to get the indexes
consistent across both versions of the database.

> Out of curiosity: how does the addition of the script property affect
> the number of distinct database records, and the total size of the database?

I'm not exactly sure how to measure this, but the length of
_PyUnicode_Database_Records goes from 229 entries to 690 entries.

If it's any help I can post the output of makeunicodedata.py.

> I think a common application would be lower-cases script names, for more
> efficient comparison; UCD has also changed the spelling of the script
> names over time (from being all-capital before). So I propose that
> a) two functions are provided: one with the original script names, and
> one with the lower-case script names

It this really neccessary, if we only have one version of the database?

> b) keep cached versions of interned script name strings in separate
> arrays, to avoid PyString_FromString every time.

Implemented.

> I'm doubtful that script names need to be provided for old database
> versions, so I would be happy to not record the script for old versions,
> and raise an exception if somebody tries to get the script for an old
> database version - surely applications of the old database records won't
> be accessing the script property, anyway.

OK, I've removed the script_changes info for the old database. (And with
this change the list of script names is no longer hardcoded).

Here's a new version of the patch (unicode-script-2.diff).
History
Date User Action Args
2009-06-24 18:56:07doerwaltersetrecipients: + doerwalter, loewis, ezio.melotti, akitada
2009-06-24 18:56:05doerwalterlinkissue6331 messages
2009-06-24 18:55:59doerwaltercreate