Message 122867 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	belopolsky, ezio.melotti, lemburg, loewis
Date	2010-11-29.20:42:30
SpamBayes Score	1.5306226e-05
Marked as misclassified	No
Message-id	<4CF41035.1010205@egenix.com>
In-reply-to	<1291061459.3.0.34838576769.issue10575@psf.upfronthosting.co.za>

Content
Martin v. Löwis wrote: > > Martin v. Löwis <martin@v.loewis.de> added the comment: > > This is not a bug, see > > http://www.unicode.org/reports/tr44/#Numeric_Value > > Characters have a Numeric_Type property of either null, Decimal, Digit, or Numeric. For non-Unihan characters, this is denoted by filling out either no column, or (6,7,and 8), or (7 and 8), or (8), respectively, as implemented by makeunicodedata.py. Unihan characters have only null or Numeric as their Numeric_Type property, never Decimal nor Digit, see > > http://www.unicode.org/reports/tr44/#Numeric_Type_Han > > Therefore, it is correct that digit() raises a ValueError for U+4e09. You're right. I guess this is a bug in the UCD or TR44/TR38 itself. It looks like the numeric properties are not separated in the Unihan database in the same way they are for the standard UCD. Unihan separates based on usage context, whereas UCS takes a parsing approach.

Martin v. Löwis wrote:
> 
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
> This is not a bug, see
> 
> http://www.unicode.org/reports/tr44/#Numeric_Value
> 
> Characters have a Numeric_Type property of either null, Decimal, Digit, or Numeric. For non-Unihan characters, this is denoted by filling out either no column, or (6,7,and 8), or (7 and 8), or (8), respectively, as implemented by makeunicodedata.py. Unihan characters have only null or Numeric as their Numeric_Type property, never Decimal nor Digit, see
> 
>  http://www.unicode.org/reports/tr44/#Numeric_Type_Han
> 
> Therefore, it is correct that digit() raises a ValueError for U+4e09.

You're right. I guess this is a bug in the UCD or TR44/TR38 itself.

It looks like the numeric properties are not separated in the
Unihan database in the same way they are for the standard UCD.

Unihan separates based on usage context, whereas UCS takes
a parsing approach.

History
Date	User	Action	Args
2010-11-29 20:42:32	lemburg	set	recipients: + lemburg, loewis, belopolsky, ezio.melotti
2010-11-29 20:42:30	lemburg	link	issue10575 messages
2010-11-29 20:42:30	lemburg	create