classification
Title: makeunicodedata.py does not support Unihan digit data
Type: Stage:
Components: Unicode Versions: Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: belopolsky, ezio.melotti, lemburg, loewis
Priority: normal Keywords:

Created on 2010-11-29 11:10 by lemburg, last changed 2010-11-29 20:46 by loewis. This issue is now closed.

Messages (13)
msg122786 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-11-29 11:10
The script only patches numeric data into the table (field 8), but does not update the digit field (field 7).

As a result, ideographs used for Chinese digits are not recognized as digits and not evaluated by int(), long() and float():

    http://en.wikipedia.org/wiki/Numbers_in_Chinese_culture

>>> unicode('三', 'utf-8')
u'\u4e09'

>>> int(unicode('三', 'utf-8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'decimal' codec can't encode character u'\u4e09' in position 0: invalid decimal Unicode string
> <stdin>(1)<module>()

>>> import unicodedata
>>> unicodedata.digit(unicode('三', 'utf-8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: not a digit

The code point refers to the digit 3.
msg122809 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-11-29 15:15
The code point is also not listed as decimal digit (relevant for the int() decimal parsing):

>>> unicodedata.decimal(unicode('三', 'utf-8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: not a decimal

This is the relevant part of the script:

        for line in open(unihan):
            if not line.startswith('U+'):
                continue
            code, tag, value = line.split(None, 3)[:3]
            if tag not in ('kAccountingNumeric', 'kPrimaryNumeric',
                           'kOtherNumeric'):
                continue
            value = value.strip().replace(',', '')
            i = int(code[2:], 16)
            # Patch the numeric field
            if table[i] is not None:
                table[i][8] = value

The decimal column is not set for code points that have a kPrimaryNumeric value set. Position table[i][8] refers to the
numeric database entry, which correctly gives:

>>> unicodedata.numeric(unicode('三', 'utf-8'))
3.0
msg122811 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-11-29 15:16
Here's a quick overview of the fields that are set for U+4E09:

http://www.fileformat.info/info/unicode/char/4e09/index.htm
msg122812 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-11-29 15:17
This is the definition of kPrimaryNumeric

http://ftp.lanet.lv/ftp/mirror/unicode/5.0.0/ucd/Unihan.html#kPrimaryNumeric
msg122827 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-29 16:45
I am adding #10552 as a dependency because I think we should fix unicode data generation in 3.x before adding new features to the scripts.

I am also not sure whether this is a bug or a feature request. Martin?
msg122839 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-11-29 18:29
Alexander Belopolsky wrote:
> 
> Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:
> 
> I am adding #10552 as a dependency because I think we should fix unicode data generation in 3.x before adding new features to the scripts.
> 
> I am also not sure whether this is a bug or a feature request. Martin?

I consider this a bug (which is why I added Python 2.7 to the list
of versions), since those code points need to be mapped to decimal
and digit as well (see the references I posted; and compare ).

Both Chinese and Japanese use the 4E00 ff. code points as decimal
code points.
msg122851 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-29 19:04
On Mon, Nov 29, 2010 at 1:29 PM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:
..
>
> I consider this a bug (which is why I added Python 2.7 to the list
> of versions), since those code points need to be mapped to decimal
> and digit as well (see the references I posted; and compare ).
>

I don't disagree.  However using Unicode 5.2.0 instead of the latest
6.0.0 may be considered a bug as well.  The practical issue is whether
to maintain two separate versions of Tools/unicode for 3.x and 2.7 or
merge 3.x changes back to 2.7 and support 3.x using 2to3.  Another
option is to simply use only 2.7 (or only 3.x) with Tools/unicode and
maintain control the differences between 2.7 and 3.x using a command
line switch.
msg122859 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-11-29 19:52
> I am adding #10552 as a dependency because I think we should fix
> unicode data generation in 3.x before adding new features to the
> scripts.
> 
> I am also not sure whether this is a bug or a feature request.
> Martin?

I fail to see the relevance of gencodec to this issue (and, as
you see in my comment to #10552, I very much fail to see the relevance
of that issue, or of gencodec in the first place).
msg122862 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-11-29 20:10
This is not a bug, see

http://www.unicode.org/reports/tr44/#Numeric_Value

Characters have a Numeric_Type property of either null, Decimal, Digit, or Numeric. For non-Unihan characters, this is denoted by filling out either no column, or (6,7,and 8), or (7 and 8), or (8), respectively, as implemented by makeunicodedata.py. Unihan characters have only null or Numeric as their Numeric_Type property, never Decimal nor Digit, see

 http://www.unicode.org/reports/tr44/#Numeric_Type_Han

Therefore, it is correct that digit() raises a ValueError for U+4e09.
msg122863 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-11-29 20:12
Alexander Belopolsky wrote:
> 
> Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:
> 
> On Mon, Nov 29, 2010 at 1:29 PM, Marc-Andre Lemburg
> <report@bugs.python.org> wrote:
> ..
>>
>> I consider this a bug (which is why I added Python 2.7 to the list
>> of versions), since those code points need to be mapped to decimal
>> and digit as well (see the references I posted; and compare ).
>>
> 
> I don't disagree.  However using Unicode 5.2.0 instead of the latest
> 6.0.0 may be considered a bug as well. 

No, since we only ever change the UCD version once per Python
release.

Note that those standard don't have a version number just for the
fun of it. Each version is a standard of its own and only
patch level updates will go into it.

It's not a bug to stick to an older UCD version.

> The practical issue is whether
> to maintain two separate versions of Tools/unicode for 3.x and 2.7 or
> merge 3.x changes back to 2.7 and support 3.x using 2to3.  Another
> option is to simply use only 2.7 (or only 3.x) with Tools/unicode and
> maintain control the differences between 2.7 and 3.x using a command
> line switch.

I'm not sure whether the effort is worth it. We don't run those
tools often enough to invest much time into keeping them in sync
between 2.x and 3.x.
msg122866 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-29 20:22
> I fail to see the relevance of gencodec to this issue ...

Thanks for the explanation.  I wrongly assumed that "make all" is the way to regenerate both unicodedata and the encodings and that the two are interdependent.
msg122867 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-11-29 20:42
Martin v. Löwis wrote:
> 
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
> This is not a bug, see
> 
> http://www.unicode.org/reports/tr44/#Numeric_Value
> 
> Characters have a Numeric_Type property of either null, Decimal, Digit, or Numeric. For non-Unihan characters, this is denoted by filling out either no column, or (6,7,and 8), or (7 and 8), or (8), respectively, as implemented by makeunicodedata.py. Unihan characters have only null or Numeric as their Numeric_Type property, never Decimal nor Digit, see
> 
>  http://www.unicode.org/reports/tr44/#Numeric_Type_Han
> 
> Therefore, it is correct that digit() raises a ValueError for U+4e09.

You're right. I guess this is a bug in the UCD or TR44/TR38 itself.

It looks like the numeric properties are not separated in the
Unihan database in the same way they are for the standard UCD.

Unihan separates based on usage context, whereas UCS takes
a parsing approach.
msg122868 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-11-29 20:42
> Thanks for the explanation.  I wrongly assumed that "make all" is the
> way to regenerate both unicodedata and the encodings and that the two
> are interdependent.

Ah. I never use the Makefile.
History
Date User Action Args
2010-11-29 20:46:24loewissetstatus: open -> closed
resolution: not a bug
2010-11-29 20:42:58loewissetmessages: + msg122868
2010-11-29 20:42:30lemburgsetmessages: + msg122867
2010-11-29 20:22:31belopolskysetdependencies: - Tools/unicode/gencodec.py error
messages: + msg122866
2010-11-29 20:12:50lemburgsetmessages: + msg122863
2010-11-29 20:10:55loewissetmessages: + msg122862
2010-11-29 19:52:15loewissetmessages: + msg122859
2010-11-29 19:04:54belopolskysetmessages: + msg122851
2010-11-29 18:29:00lemburgsetmessages: + msg122839
2010-11-29 16:49:02ezio.melottisetnosy: + ezio.melotti
2010-11-29 16:45:33belopolskysetnosy: + loewis, belopolsky
dependencies: + Tools/unicode/gencodec.py error
messages: + msg122827
2010-11-29 15:17:22lemburgsetmessages: + msg122812
2010-11-29 15:16:14lemburgsetmessages: + msg122811
2010-11-29 15:15:36lemburgsetmessages: + msg122809
2010-11-29 11:10:54lemburgcreate