Issue 10575: makeunicodedata.py does not support Unihan digit data

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/54784

classification

Title:	makeunicodedata.py does not support Unihan digit data
Type:		Stage:
Components:	Unicode	Versions:	Python 3.2, Python 3.3, Python 2.7

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	belopolsky, ezio.melotti, lemburg, loewis
Priority:	normal	Keywords:

Created on 2010-11-29 11:10 by lemburg, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (13)
msg122786 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-11-29 11:10
The script only patches numeric data into the table (field 8), but does not update the digit field (field 7). As a result, ideographs used for Chinese digits are not recognized as digits and not evaluated by int(), long() and float(): http://en.wikipedia.org/wiki/Numbers_in_Chinese_culture >>> unicode('三', 'utf-8') u'\u4e09' >>> int(unicode('三', 'utf-8')) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'decimal' codec can't encode character u'\u4e09' in position 0: invalid decimal Unicode string > <stdin>(1)<module>() >>> import unicodedata >>> unicodedata.digit(unicode('三', 'utf-8')) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: not a digit The code point refers to the digit 3.
msg122809 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-11-29 15:15
The code point is also not listed as decimal digit (relevant for the int() decimal parsing): >>> unicodedata.decimal(unicode('三', 'utf-8')) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: not a decimal This is the relevant part of the script: for line in open(unihan): if not line.startswith('U+'): continue code, tag, value = line.split(None, 3)[:3] if tag not in ('kAccountingNumeric', 'kPrimaryNumeric', 'kOtherNumeric'): continue value = value.strip().replace(',', '') i = int(code[2:], 16) # Patch the numeric field if table[i] is not None: table[i][8] = value The decimal column is not set for code points that have a kPrimaryNumeric value set. Position table[i][8] refers to the numeric database entry, which correctly gives: >>> unicodedata.numeric(unicode('三', 'utf-8')) 3.0
msg122811 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-11-29 15:16
Here's a quick overview of the fields that are set for U+4E09: http://www.fileformat.info/info/unicode/char/4e09/index.htm
msg122812 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-11-29 15:17
This is the definition of kPrimaryNumeric http://ftp.lanet.lv/ftp/mirror/unicode/5.0.0/ucd/Unihan.html#kPrimaryNumeric
msg122827 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-29 16:45
I am adding #10552 as a dependency because I think we should fix unicode data generation in 3.x before adding new features to the scripts. I am also not sure whether this is a bug or a feature request. Martin?
msg122839 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-11-29 18:29
Alexander Belopolsky wrote: > > Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment: > > I am adding #10552 as a dependency because I think we should fix unicode data generation in 3.x before adding new features to the scripts. > > I am also not sure whether this is a bug or a feature request. Martin? I consider this a bug (which is why I added Python 2.7 to the list of versions), since those code points need to be mapped to decimal and digit as well (see the references I posted; and compare ). Both Chinese and Japanese use the 4E00 ff. code points as decimal code points.
msg122851 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-29 19:04
On Mon, Nov 29, 2010 at 1:29 PM, Marc-Andre Lemburg <report@bugs.python.org> wrote: .. > > I consider this a bug (which is why I added Python 2.7 to the list > of versions), since those code points need to be mapped to decimal > and digit as well (see the references I posted; and compare ). > I don't disagree. However using Unicode 5.2.0 instead of the latest 6.0.0 may be considered a bug as well. The practical issue is whether to maintain two separate versions of Tools/unicode for 3.x and 2.7 or merge 3.x changes back to 2.7 and support 3.x using 2to3. Another option is to simply use only 2.7 (or only 3.x) with Tools/unicode and maintain control the differences between 2.7 and 3.x using a command line switch.
msg122859 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2010-11-29 19:52
> I am adding #10552 as a dependency because I think we should fix > unicode data generation in 3.x before adding new features to the > scripts. > > I am also not sure whether this is a bug or a feature request. > Martin? I fail to see the relevance of gencodec to this issue (and, as you see in my comment to #10552, I very much fail to see the relevance of that issue, or of gencodec in the first place).
msg122862 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2010-11-29 20:10
This is not a bug, see http://www.unicode.org/reports/tr44/#Numeric_Value Characters have a Numeric_Type property of either null, Decimal, Digit, or Numeric. For non-Unihan characters, this is denoted by filling out either no column, or (6,7,and 8), or (7 and 8), or (8), respectively, as implemented by makeunicodedata.py. Unihan characters have only null or Numeric as their Numeric_Type property, never Decimal nor Digit, see http://www.unicode.org/reports/tr44/#Numeric_Type_Han Therefore, it is correct that digit() raises a ValueError for U+4e09.
msg122863 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-11-29 20:12
Alexander Belopolsky wrote: > > Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment: > > On Mon, Nov 29, 2010 at 1:29 PM, Marc-Andre Lemburg > <report@bugs.python.org> wrote: > .. >> >> I consider this a bug (which is why I added Python 2.7 to the list >> of versions), since those code points need to be mapped to decimal >> and digit as well (see the references I posted; and compare ). >> > > I don't disagree. However using Unicode 5.2.0 instead of the latest > 6.0.0 may be considered a bug as well. No, since we only ever change the UCD version once per Python release. Note that those standard don't have a version number just for the fun of it. Each version is a standard of its own and only patch level updates will go into it. It's not a bug to stick to an older UCD version. > The practical issue is whether > to maintain two separate versions of Tools/unicode for 3.x and 2.7 or > merge 3.x changes back to 2.7 and support 3.x using 2to3. Another > option is to simply use only 2.7 (or only 3.x) with Tools/unicode and > maintain control the differences between 2.7 and 3.x using a command > line switch. I'm not sure whether the effort is worth it. We don't run those tools often enough to invest much time into keeping them in sync between 2.x and 3.x.
msg122866 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-29 20:22
> I fail to see the relevance of gencodec to this issue ... Thanks for the explanation. I wrongly assumed that "make all" is the way to regenerate both unicodedata and the encodings and that the two are interdependent.
msg122867 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-11-29 20:42
Martin v. Löwis wrote: > > Martin v. Löwis <martin@v.loewis.de> added the comment: > > This is not a bug, see > > http://www.unicode.org/reports/tr44/#Numeric_Value > > Characters have a Numeric_Type property of either null, Decimal, Digit, or Numeric. For non-Unihan characters, this is denoted by filling out either no column, or (6,7,and 8), or (7 and 8), or (8), respectively, as implemented by makeunicodedata.py. Unihan characters have only null or Numeric as their Numeric_Type property, never Decimal nor Digit, see > > http://www.unicode.org/reports/tr44/#Numeric_Type_Han > > Therefore, it is correct that digit() raises a ValueError for U+4e09. You're right. I guess this is a bug in the UCD or TR44/TR38 itself. It looks like the numeric properties are not separated in the Unihan database in the same way they are for the standard UCD. Unihan separates based on usage context, whereas UCS takes a parsing approach.
msg122868 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2010-11-29 20:42
> Thanks for the explanation. I wrongly assumed that "make all" is the > way to regenerate both unicodedata and the encodings and that the two > are interdependent. Ah. I never use the Makefile.

History
Date	User	Action	Args
2022-04-11 14:57:09	admin	set	github: 54784
2010-11-29 20:46:24	loewis	set	status: open -> closed resolution: not a bug
2010-11-29 20:42:58	loewis	set	messages: + msg122868
2010-11-29 20:42:30	lemburg	set	messages: + msg122867
2010-11-29 20:22:31	belopolsky	set	dependencies: - Tools/unicode/gencodec.py error messages: + msg122866
2010-11-29 20:12:50	lemburg	set	messages: + msg122863
2010-11-29 20:10:55	loewis	set	messages: + msg122862
2010-11-29 19:52:15	loewis	set	messages: + msg122859
2010-11-29 19:04:54	belopolsky	set	messages: + msg122851
2010-11-29 18:29:00	lemburg	set	messages: + msg122839
2010-11-29 16:49:02	ezio.melotti	set	nosy: + ezio.melotti
2010-11-29 16:45:33	belopolsky	set	nosy: + loewis, belopolsky dependencies: + Tools/unicode/gencodec.py error messages: + msg122827
2010-11-29 15:17:22	lemburg	set	messages: + msg122812
2010-11-29 15:16:14	lemburg	set	messages: + msg122811
2010-11-29 15:15:36	lemburg	set	messages: + msg122809
2010-11-29 11:10:54	lemburg	create