This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author mdk
Recipients Anna Koroliuk, docs@python, ethan.furman, ezio.melotti, martin.panter, mdk, serhiy.storchaka, terry.reedy, vstinner
Date 2016-03-12.21:31:15
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1457818275.65.0.809023061424.issue26483@psf.upfronthosting.co.za>
In-reply-to
Content
To dig further, the DIGIT_MASK and DECIMAL_MASK used in `unicodeobject.c` are from `unicodectype.c` and they match values from `unicodetype_db.h` witch is generated by `Tools/unicode/makeunicodedata.py` which built those masks this way:

    # decimal digit, integer digit
    decimal = 0
    if record[6]:
        flags |= DECIMAL_MASK
        decimal = int(record[6])
    digit = 0
    if record[7]:
        flags |= DIGIT_MASK
        digit = int(record[7])
    if record[8]:
        flags |= NUMERIC_MASK
        numeric.setdefault(record[8], []).append(char)

Those "record"s are documented in ftp://unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html in which fields 6, 7, and 8 are:

 - 6	Decimal digit value	N	This is a numeric field. If the character has the decimal digit property, as specified in Chapter 4 of the Unicode Standard, the value of that digit is represented with an integer value in this field

 - 7	Digit value	N	This is a numeric field. If the character represents a digit, not necessarily a decimal digit, the value is here. This covers digits which do not form decimal radix forms, such as the compatibility superscript digits

 - 8	Numeric value	N	This is a numeric field. If the character has the numeric property, as specified in Chapter 4 of the Unicode Standard, the value of that character is represented with an integer or rational number in this field. This includes fractions as, e.g., "1/5" for U+2155 VULGAR FRACTION ONE FIFTH Also included are numerical values for compatibility characters such as circled numbers.

Which is very close of the actual documentation. Yet the documentation is misleading using "This category includes digit characters" in the "isdecimal" documentation.

Posssible rewriting:

isdecimal: Return true if all characters in the string are decimal characters and there is at least one character, false otherwise. Decimal characters are those that can be used to form decimal-radix numbers, e.g. U+0660, ARABIC-INDIC DIGIT ZERO. Formally a decimal character is a character in the Unicode General Category "Nd".

isdigit: Return true if all characters in the string are digits and there is at least one character, false otherwise. Digits include decimal characters and digits that need special handling, such as the compatibility superscript digits. This covers digits which do not form decimal radix forms. Formally, a digit is a character that has the property value Numeric_Type=Digit or Numeric_Type=Decimal.

I don't think we can refactor more than this without rewriting documentation for isnumeric which mentions the Unicode standard the same way.
History
Date User Action Args
2016-03-12 21:31:15mdksetrecipients: + mdk, terry.reedy, vstinner, ezio.melotti, docs@python, ethan.furman, martin.panter, serhiy.storchaka, Anna Koroliuk
2016-03-12 21:31:15mdksetmessageid: <1457818275.65.0.809023061424.issue26483@psf.upfronthosting.co.za>
2016-03-12 21:31:15mdklinkissue26483 messages
2016-03-12 21:31:15mdkcreate