Message 91225 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	ezio.melotti
Date	2009-08-03.16:34:41
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<1249317285.35.0.709481915004.issue6632@psf.upfronthosting.co.za>
In-reply-to

Content
The decimal codec only handles characters in the Nd (Number, decimal) Unicode category and whitespaces [a]. It is used by int(), float(), complex() and indirectly by Decimal(), Fraction() and possibly others. This works well only for plain digits (e.g. int(u'１２３')) but it doesn't work for all the other characters used to represent numbers, like: 1. plus or minus sign, e.g. int(u'＋１２３') or int(u'－１２３') 2. decimal point, e.g. float(u'１．２３') 2.1 some languages/alphabets use other chars (e.g. a comma or other symbols) instead of the decimal point. 3. exponential notation, e.g. float(u'１ｅ５') 4. the 'j' in complex numbers, e.g. complex(u'３ｊ') 5. the 'x' and 'p' in hexadecimal floats, e.g. float.fromhex(u'０ｘ１．７ｐ３') 5.1 hex floats also uses hexadecimal digits, see 6.3 6. digits > 9 for numbers with a base > 10, e.g. int(u'７Ｆ', 16) 6.1 not all the alphabets have the equivalent of the letters a-z 6.2 afaik there are no standards that specify how to deal with digits >9 6.3 in the Unicode FAQ [b] there's a link to a table [c] that says "Code points not listed in this file are not hexadecimal digits." This is not a standard though, and even if in the UCD [d] there's a file [e] where the numbers with the Hex_Digit property are defined, it doesn't say that only these numbers are valid hex digits. Also it doesn't say anything about different bases. Python currently accepts int(u'１０', 16), int(u'७', 16) (U+096D - DEVANAGARI DIGIT SEVEN) and even int(u'７F', 16) (with a normal F it works, with a fullwidth Ｆ it fails). 6.4 UTS #18 [f] includes in the property 'xdigit' [g] (hexadecimal digit) all the chars defined in [c] and also all the chars with a Nd category. This also is not a standard, and it doesn't give indications about the valid hex digits and how int() should behave. 6.5 if possible re and int() should agree. Any string that matches /^[[:xdigit:]]+$/ should work fine with int(s, 16) and vice versa. See also #6561 [h] and #2636 [i]. 7. possibly others For all the chars listed in the points 1-5 there's no way, AFAIK, to know their equivalents in other alphabets (if they exist at all) and since (apparently) there's no standard that specifies how to handle them, they should be kept out. This will also avoid a number of problems, e.g. 2.1. The fullwidth forms are an exception though: they seem to be the only set of characters with a direct equivalent for all these chars, and they are also the only non-ascii chars included in the list of chars with the Unicode Hex_Digits property. Including all the necessary chars from this range in the decimal codec seems to me the best thing to do. The chars listed in the points 1-5 should all be implemented and they should work everywhere. The regex used by Decimal/Fraction should be updated as well, since the decimal codec is not accessible from Python (maybe it should be accessible, but this is another issue). Point 6 is a slightly different issue, even if it can be partially solved if the fullwidth forms will be included. One of the possible options is to limit the valid chars used by int() with bases > 10 only to the characters listed in [c], but this won't be backward-compatible with existing code and forward-compatible with [[:xdigit:]]. OTOH if we keep the current behavior it will be possible to express the digits from 0 to 9 using several alphabets, but all the digits > 9 will be limited to [a-fA-F] (and possibly [ａ－ｆＡ－Ｆ]). For example, '7F' in the devanagari alphabet will result in a mix of devanagari numbers and ascii letters, i.e. int(u'७F', 16) (this already works in Python). [a]: http://svn.python.org/view/python/trunk/Objects/unicodeobject.c?view=markup under 'Decimal Encoder' [b]: http://unicode.org/faq/casemap_charprop.html#13 [c]: http://unicode.org/faq/hex-digit-values.txt - [0-9a-fA- F０－９ａ－ｆＡ－Ｆ] [d]: http://unicode.org/Public/UNIDATA/UCD.html#UCD_Files - PropList.txt section [e]: http://unicode.org/Public/UNIDATA/PropList.txt [f]: http://unicode.org/reports/tr18/ - UTS #18: Unicode Regular Expressions [g]: http://unicode.org/reports/tr18/#Compatibility_Properties - xdigit row [h]: http://bugs.python.org/issue6561#msg90878 point (1) about int() and re [i]: http://bugs.python.org/issue2636#msg65513 point 8) will introduce [[:xdigit:]] (Thanks to Mark Dickinson and Adam Olsen for pointing out some of these issues.)

The decimal codec only handles characters in the Nd (Number, decimal)
Unicode category and whitespaces [a]. It is used by int(), float(),
complex() and indirectly by Decimal(), Fraction() and possibly others.
This works well only for plain digits (e.g. int(u'１２３')) but it
doesn't work for all the other characters used to represent numbers, like:
1. plus or minus sign, e.g. int(u'＋１２３') or int(u'－１２３')
2. decimal point, e.g. float(u'１．２３')
   2.1 some languages/alphabets use other chars (e.g. a comma or other
       symbols) instead of the decimal point.
3. exponential notation, e.g. float(u'１ｅ５')
4. the 'j' in complex numbers, e.g. complex(u'３ｊ')
5. the 'x' and 'p' in hexadecimal floats, e.g.
float.fromhex(u'０ｘ１．７ｐ３')
   5.1 hex floats also uses hexadecimal digits, see 6.3
6. digits > 9 for numbers with a base > 10, e.g. int(u'７Ｆ', 16)
    6.1 not all the alphabets have the equivalent of the letters a-z
    6.2 afaik there are no standards that specify how to deal with
        digits >9
    6.3 in the Unicode FAQ [b] there's a link to a table [c] that says
        "Code points not listed in this file are not hexadecimal
        digits." This is not a standard though, and even if in the
        UCD [d] there's a file [e] where the numbers with the Hex_Digit
        property are defined, it doesn't say that *only* these numbers
        are valid hex digits. Also it doesn't say anything about
        different bases.
        Python currently accepts int(u'１０', 16), int(u'७', 16)
        (U+096D - DEVANAGARI DIGIT SEVEN) and even int(u'７F', 16)
        (with a normal F it works, with a fullwidth Ｆ it fails).
    6.4 UTS #18 [f] includes in the property 'xdigit' [g] (hexadecimal
        digit) all the chars defined in [c] and also all the chars with
        a Nd category. This also is not a standard, and it doesn't
        give indications about the valid hex digits and how int()
        should behave.
    6.5 if possible re and int() should agree. Any string that matches
        /^[[:xdigit:]]+$/ should work fine with int(s, 16) and vice 
        versa. See also #6561 [h] and #2636 [i].
7. possibly others


For all the chars listed in the points 1-5 there's no way, AFAIK, to
know their equivalents in other alphabets (if they exist at all) and
since (apparently) there's no standard that specifies how to handle
them, they should be kept out.
This will also avoid a number of problems, e.g. 2.1.

The fullwidth forms are an exception though: they seem to be the only
set of characters with a direct equivalent for all these chars, and they
are also the only non-ascii chars included in the list of chars with the
Unicode Hex_Digits property.

Including all the necessary chars from this range in the decimal codec
seems to me the best thing to do. The chars listed in the points 1-5
should all be implemented and they should work everywhere. The regex
used by Decimal/Fraction should be updated as well, since the decimal
codec is not accessible from Python (maybe it should be accessible, but
this is another issue).

Point 6 is a slightly different issue, even if it can be partially
solved if the fullwidth forms will be included. One of the possible
options is to limit the valid chars used by int() with bases > 10 only
to the characters listed in [c], but this won't be backward-compatible
with existing code and forward-compatible with [[:xdigit:]].
OTOH if we keep the current behavior it will be possible to express the
digits from 0 to 9 using several alphabets, but all the digits > 9 will
be limited to [a-fA-F] (and possibly [ａ－ｆＡ－Ｆ]).
For example, '7F' in the devanagari alphabet will result in a mix of
devanagari numbers and ascii letters, i.e. int(u'७F', 16) (this already
works in Python).


[a]:
http://svn.python.org/view/python/trunk/Objects/unicodeobject.c?view=markup
under 'Decimal Encoder'
[b]: http://unicode.org/faq/casemap_charprop.html#13
[c]: http://unicode.org/faq/hex-digit-values.txt - [0-9a-fA-
F０－９ａ－ｆＡ－Ｆ]
[d]: http://unicode.org/Public/UNIDATA/UCD.html#UCD_Files - PropList.txt
section
[e]: http://unicode.org/Public/UNIDATA/PropList.txt
[f]: http://unicode.org/reports/tr18/ - UTS #18: Unicode Regular Expressions
[g]: http://unicode.org/reports/tr18/#Compatibility_Properties - xdigit row
[h]: http://bugs.python.org/issue6561#msg90878 point (1) about int() and re
[i]: http://bugs.python.org/issue2636#msg65513 point 8) will introduce
[[:xdigit:]]

(Thanks to Mark Dickinson and Adam Olsen for pointing out some of these
issues.)

History
Date	User	Action	Args
2009-08-03 16:34:45	ezio.melotti	set	recipients: + ezio.melotti
2009-08-03 16:34:45	ezio.melotti	set	messageid: <1249317285.35.0.709481915004.issue6632@psf.upfronthosting.co.za>
2009-08-03 16:34:44	ezio.melotti	link	issue6632 messages
2009-08-03 16:34:42	ezio.melotti	create