Issue 6331: Add unicode script info to the unicode database

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/50580

classification

Title:	Add unicode script info to the unicode database
Type:	enhancement	Stage:	patch review
Components:	Unicode	Versions:	Python 3.7

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Cosimo Lupo, Denis Jacquerye, Elizacat, Greg Price, PanderMusubi, akitada, benjamin.peterson, berker.peksag, doerwalter, ezio.melotti, lemburg, loewis, vstinner
Priority:	normal	Keywords:	needs review, patch

Created on 2009-06-23 20:50 by doerwalter, last changed 2022-04-11 14:56 by admin.

Files
File name	Uploaded	Description	Edit
unicode-script.diff	doerwalter, 2009-06-23 20:50		review
unicode-script-2.diff	doerwalter, 2009-06-24 18:56		review
unicode-script-3.diff	doerwalter, 2009-07-01 10:55		review

Messages (17)
msg89642 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2009-06-23 20:50
This patch adds a function unicodedata.script() that returns information about the script of the Unicode character.
msg89647 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2009-06-23 21:35
I think the patch is incorrect: the default value for the script property ought to be Unknown, not Common (despite UCD.html saying the contrary; see UTR#24 and Scripts.txt). I'm puzzled why you use a hard-coded list of script names. The set of scripts will certainly change across Unicode versions, and I think it would be better to learn the script names from Scripts.txt. Out of curiosity: how does the addition of the script property affect the number of distinct database records, and the total size of the database? I think a common application would be lower-cases script names, for more efficient comparison; UCD has also changed the spelling of the script names over time (from being all-capital before). So I propose that a) two functions are provided: one with the original script names, and one with the lower-case script names b) keep cached versions of interned script name strings in separate arrays, to avoid PyString_FromString every time. I'm doubtful that script names need to be provided for old database versions, so I would be happy to not record the script for old versions, and raise an exception if somebody tries to get the script for an old database version - surely applications of the old database records won't be accessing the script property, anyway.
msg89671 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2009-06-24 18:55
Martin v. Löwis wrote: > Martin v. Löwis <martin@v.loewis.de> added the comment: > > I think the patch is incorrect: the default value for the script > property ought to be Unknown, not Common (despite UCD.html saying the > contrary; see UTR#24 and Scripts.txt). Fixed. > I'm puzzled why you use a hard-coded list of script names. The set of > scripts will certainly change across Unicode versions, and I think it > would be better to learn the script names from Scripts.txt. I hardcoded the list, because I saw no easy way to get the indexes consistent across both versions of the database. > Out of curiosity: how does the addition of the script property affect > the number of distinct database records, and the total size of the database? I'm not exactly sure how to measure this, but the length of _PyUnicode_Database_Records goes from 229 entries to 690 entries. If it's any help I can post the output of makeunicodedata.py. > I think a common application would be lower-cases script names, for more > efficient comparison; UCD has also changed the spelling of the script > names over time (from being all-capital before). So I propose that > a) two functions are provided: one with the original script names, and > one with the lower-case script names It this really neccessary, if we only have one version of the database? > b) keep cached versions of interned script name strings in separate > arrays, to avoid PyString_FromString every time. Implemented. > I'm doubtful that script names need to be provided for old database > versions, so I would be happy to not record the script for old versions, > and raise an exception if somebody tries to get the script for an old > database version - surely applications of the old database records won't > be accessing the script property, anyway. OK, I've removed the script_changes info for the old database. (And with this change the list of script names is no longer hardcoded). Here's a new version of the patch (unicode-script-2.diff).
msg89675 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2009-06-24 19:31
>> I'm puzzled why you use a hard-coded list of script names. The set of >> scripts will certainly change across Unicode versions, and I think it >> would be better to learn the script names from Scripts.txt. > > I hardcoded the list, because I saw no easy way to get the indexes > consistent across both versions of the database. Couldn't you have a global cache, something like scripts = ['Unknown'] def findscript(script): try: return scripts.index(script) except ValueError: scripts.append(script) return len(scripts)-1 >> Out of curiosity: how does the addition of the script property affect >> the number of distinct database records, and the total size of the database? > > I'm not exactly sure how to measure this, but the length of > _PyUnicode_Database_Records goes from 229 entries to 690 entries. I think this needs to be fixed, then - we need to study why there are so many new records (e.g. what script contributes most new records), and then look for alternatives. One alternative could be to create a separate Trie for scripts. I'd also be curious if we can increase the homogeneity of scripts (i.e. produce longer runs of equal scripts) if we declare that unassigned code points have the script that corresponds to the block (i.e. the script that surrounding characters have), and then only change it to "Unknown" at lookup time if it's unassigned. > If it's any help I can post the output of makeunicodedata.py. I'd be interested in "size unicodedata.so", and how it changes. Perhaps the actual size increase isn't that bad. >> a) two functions are provided: one with the original script names, and >> one with the lower-case script names > > It this really neccessary, if we only have one version of the database? I don't know what this will be used for, but one application is certainly regular expressions. So we need an efficient test whether the character is in the expected script or not. It would be bad if such a test would have to do a .lower() on each lookup.
msg89701 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2009-06-25 09:14
I was comparing apples and oranges: The 229 entries for the trunk where for an UCS2 build (the patched version was UCS4), with UCS4 there are 317 entries for the trunk. size unicodedata.o gives: __TEXT __DATA __OBJC others dec hex 13622 587057 0 23811 624490 9876a for trunk and __TEXT __DATA __OBJC others dec hex 17769 588817 0 24454 631040 9a100 for the patched version.
msg89973 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2009-07-01 10:54
Here is a new version that includes a new function scriptl() that returns the script name in lowercase.
msg111040 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2010-07-21 12:19
Could someone with unicode knowledge take this review on, given that comments have already been made and responded to?
msg177469 - (view)	Author: Pander (PanderMusubi)	Date: 2012-12-14 16:52
Please, also consider reviewing functionality offered by: http://pypi.python.org/pypi/unicodescript/ and http://pypi.python.org/pypi/unicodeblocks/ which could be used to improve and extend the proposed patch.
msg177506 - (view)	Author: Pander (PanderMusubi)	Date: 2012-12-14 20:41
The latest version of the respective sources can be found here: https://github.com/ConradIrwin/unicodescript and here: https://github.com/simukis/unicodeblocks
msg214204 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2014-03-20 11:10
Pander: In what way would this extend or improve the current patch?
msg214633 - (view)	Author: Pander (PanderMusubi)	Date: 2014-03-23 20:34
I see the patch support Unicode scripts https://en.wikipedia.org/wiki/Script_%28Unicode%29 but I am also interested in support for Unicode blocks https://en.wikipedia.org/wiki/Unicode_block Code for support for the latter is at https://github.com/nagisa/unicodeblocks I could ont quiet make out of the patch also supports Unicode blocks. If not, shoudl that be requested in a separete issue? Furthermore, support for Unicode scripts and blocks should be updated each time a new version of Unicode standard is published. Someone should check of the latest patch should be updated to the latest version of Unicode. Not only for this issue but for each release of PYthon.
msg214636 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2014-03-23 21:30
Adding support for blocks should indeed go into a separate issue. Your code for that is not suitable, as it should integrate with the existing make_unicodedata.py script, which your code does not. And yes, indeed, of course, we automatically update (nearly) all data in Python automatically from the files provided by the Unicode consortium.
msg226266 - (view)	Author: Elizabeth Myers (Elizacat) *	Date: 2014-09-02 08:22
> I think this needs to be fixed, then - we need to study why there are > so many new records (e.g. what script contributes most new records), > and then look for alternatives. The "Common" script appears to be very fragmented and may be the cause of the issues. > One alternative could be to create a separate Trie for scripts. Not having seen the one in C yet, I have one written in Python, custom-made for storing the script database, based on the general idea of a range tree. It stores ranges individually straight out of Scripts.txt. The general idea is you take the average of the lower and upper bounds of a given range (they can be equal). When searching, you compare the codepoint value to the average in the present node, and use that to find which direction to search the tree in. Without coalescing neighbouring ranges that are the same script, I have 1,606 nodes in the tree (for Unicode 7.0, which added a lot of scripts). After coalescing, there appear to be 806 nodes. If anyone cares, I'll be more than happy to post code for inspection. > I don't know what this will be used for, but one application is > certainly regular expressions. So we need an efficient test whether > the character is in the expected script or not. It would be bad if > such a test would have to do a .lower() on each lookup. This is actually required for restriction-level detection as described in Unicode TR39, for all levels of restriction above ASCII-only (http://www.unicode.org/reports/tr39/#Restriction_Level_Detection).
msg251214 - (view)	Author: Cosimo Lupo (Cosimo Lupo)	Date: 2015-09-21 09:03
I would very much like a `script()` function to be added to the built-in unicodedata module. What's the current status of this issue? Thanks. Cosimo
msg285269 - (view)	Author: Pander (PanderMusubi)	Date: 2017-01-11 20:31
Any updates or ideas on how to move this forward? See also https://bugs.python.org/issue16684 Thanks.
msg320090 - (view)	Author: Pander (PanderMusubi)	Date: 2018-06-20 16:09
Since June 2018, Unicode version 11.0 is out. Perhaps that could help move this forward.
msg320092 - (view)	Author: STINNER Victor (vstinner) *	Date: 2018-06-20 16:20
> Since June 2018, Unicode version 11.0 is out. Perhaps that could help move this forward. Python 3.7 has been upgrade to Unicode 11.

History
Date	User	Action	Args
2022-04-11 14:56:50	admin	set	github: 50580
2019-08-28 05:09:27	Greg Price	set	nosy: + Greg Price
2018-06-20 16:20:05	vstinner	set	messages: + msg320092
2018-06-20 16:09:15	PanderMusubi	set	messages: + msg320090
2017-01-11 21:28:17	pitrou	set	nosy: - pitrou
2017-01-11 20:37:51	serhiy.storchaka	set	versions: + Python 3.7, - Python 3.6
2017-01-11 20:31:36	PanderMusubi	set	messages: + msg285269
2015-10-17 07:46:49	Denis Jacquerye	set	nosy: + Denis Jacquerye
2015-09-21 09:48:22	berker.peksag	set	nosy: + berker.peksag
2015-09-21 09:44:58	BreamoreBoy	set	versions: + Python 3.6, - Python 3.4
2015-09-21 09:03:10	Cosimo Lupo	set	nosy: + Cosimo Lupo messages: + msg251214
2014-09-02 08:22:51	Elizacat	set	nosy: + Elizacat messages: + msg226266
2014-03-23 21:30:53	loewis	set	messages: + msg214636
2014-03-23 20:34:08	PanderMusubi	set	messages: + msg214633
2014-03-20 11:10:32	loewis	set	messages: + msg214204
2014-02-03 18:40:43	BreamoreBoy	set	nosy: - BreamoreBoy
2013-02-10 18:26:00	pitrou	set	nosy: + lemburg, pitrou, vstinner, benjamin.peterson
2012-12-14 20:41:36	PanderMusubi	set	messages: + msg177506
2012-12-14 16:52:17	PanderMusubi	set	nosy: + PanderMusubi messages: + msg177469
2012-09-26 17:08:17	ezio.melotti	set	versions: + Python 3.4, - Python 3.2
2010-07-21 12:19:04	BreamoreBoy	set	nosy: + BreamoreBoy messages: + msg111040 versions: + Python 3.2, - Python 2.7
2009-07-24 09:44:35	ezio.melotti	set	keywords: + needs review stage: patch review
2009-07-01 10:55:20	doerwalter	set	files: + unicode-script-3.diff messages: + msg89973
2009-06-25 09:14:17	doerwalter	set	messages: + msg89701
2009-06-24 19:31:08	loewis	set	messages: + msg89675
2009-06-24 18:56:52	doerwalter	set	files: + unicode-script-2.diff
2009-06-24 18:56:05	doerwalter	set	messages: + msg89671
2009-06-24 06:36:17	ezio.melotti	set	priority: normal nosy: + ezio.melotti
2009-06-23 22:02:04	akitada	set	nosy: + akitada
2009-06-23 21:36:00	loewis	set	nosy: + loewis messages: + msg89647
2009-06-23 20:50:57	doerwalter	create