msg336451 - (view) |
Author: Marcos Dione (StyXman) * |
Date: 2019-02-24 09:02 |
Following https://blog.lerner.co.il/pythons-str-isdigit-vs-str-isnumeric/, we have this:
Python 3.8.0a1+ (heads/master:001fee14e0, Feb 20 2019, 08:28:02)
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> '一二三四五'.isnumeric()
True
>>> int('一二三四五')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '一二三四五'
>>> float('一二三四五')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: could not convert string to float: '一二三四五'
I think Reuven is right, these should be accepted as input. I just wonder if we should do the same for f.i. roman numerics...
|
msg336453 - (view) |
Author: Steven D'Aprano (steven.daprano) * |
Date: 2019-02-24 10:05 |
I think that analysis is wrong. The Wikipedia page describes the meaning of the Unicode Decimal/Digit/Numeric properties:
https://en.wikipedia.org/wiki/Unicode_character_property#Numeric_values_and_types
and the characters you show aren't appropriate for converting to ints:
py> for c in '一二三四五':
... print(unicodedata.name(c))
...
CJK UNIFIED IDEOGRAPH-4E00
CJK UNIFIED IDEOGRAPH-4E8C
CJK UNIFIED IDEOGRAPH-4E09
CJK UNIFIED IDEOGRAPH-56DB
CJK UNIFIED IDEOGRAPH-4E94
The first one, for example, is translated as "one; a, an; alone"; it is better read as the *word* one rather than the numeral 1. (Disclaimer: I am not a Chinese speaker and I welcome correction from an expert.)
Likewise U+4E8C, translated as "two; twice".
The blog post is factually wrong when it claims:
"str.isdigit only returns True for what I said before, strings containing solely the digits 0-9."
py> s = "\N{BENGALI DIGIT ONE}\N{BENGALI DIGIT TWO}"
py> s.isdigit()
True
py> int(s)
12
So I think that there's nothing to do here (unless it is perhaps to add a FAQ about it, or improve the docs).
|
msg336454 - (view) |
Author: Mark Dickinson (mark.dickinson) * |
Date: 2019-02-24 10:14 |
[Steven posted his answer while I was composing mine; posting mine anyway ...]
I don't think this would make sense. There are lots of characters that can't be interpreted as a decimal digit but for which `isnumeric` nevertheless gives True.
>>> s = "㉓⅗⒘Ⅻ"
>>> for c in s: print(unicodedata.name(c))
...
CIRCLED NUMBER TWENTY THREE
VULGAR FRACTION THREE FIFTHS
NUMBER SEVENTEEN FULL STOP
ROMAN NUMERAL TWELVE
>>> s.isnumeric()
True
What value would you expect `int(s)` to have in this situation?
Note that `int` and `float` already accept non-ASCII digits:
>>> s = "١٢٣٤٥٦٧٨٩"
>>> int(s)
123456789
>>> float(s)
123456789.0
|
msg336455 - (view) |
Author: Mark Dickinson (mark.dickinson) * |
Date: 2019-02-24 10:24 |
> What value would you expect `int(s)` to have in this situation?
Actually, I guess that question was too easy. The value for `int(s)` should *obviously* be 23 * 1000 + (3/5) * 100 + 17 * 10 + 12 = 23242. I should have used ⅐ instead of ⅗.
Anyway, agreed with Steven that no change should be made here.
|
msg336456 - (view) |
Author: Karthikeyan Singaravelan (xtreak) * |
Date: 2019-02-24 10:31 |
Not a unicode expert but searching along the lines there was a note added on issue10610 that int() is supported for characters of 'Nd' category. So to check if a string can be converted to integer with help of int() I should be using str.isdecimal() instead of str.isnumeric() ?
https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex
> The numeric literals accepted include the digits 0 to 9 or any Unicode equivalent (code points with the Nd property). See http://www.unicode.org/Public/10.0.0/ucd/extracted/DerivedNumericType.txt for a complete list of code points with the Nd property.
>>> [unicodedata.category(c) for c in '一二三四五']
['Lo', 'Lo', 'Lo', 'Lo', 'Lo']
>>> [unicodedata.category(c) for c in '\N{BENGALI DIGIT ONE}\N{BENGALI DIGIT TWO}']
['Nd', 'Nd']
|
msg336459 - (view) |
Author: Mark Dickinson (mark.dickinson) * |
Date: 2019-02-24 10:57 |
> So to check if a string can be converted to integer with help of int() I should be using str.isdecimal() instead of str.isnumeric() ?
Yes, I think that's correct. The characters matched by `str.isdecimal` are a subset of those matched by `str.isdigit`, which in turn are a subset of those matched by `str.isnumeric`. `int` and `float` required general category Nd, which corresponds to `str.isdigit`.
|
msg336460 - (view) |
Author: Mark Dickinson (mark.dickinson) * |
Date: 2019-02-24 10:58 |
> which corresponds to `str.isdigit`.
Gah! That should have said:
> which corresponds to `str.isdecimal`.
Sorry.
|
msg336461 - (view) |
Author: Karthikeyan Singaravelan (xtreak) * |
Date: 2019-02-24 11:07 |
> `int` and `float` required general category Nd, which corresponds to `str.isdigit`.
Sorry, did you mean str.isdecimal? since there could be a subset where isdigit is True and isdecimal returns False.
>>> '\u00B2'.isdecimal()
False
>>> '\u00B2'.isdigit()
True
>>> import unicodedata
>>> unicodedata.category('\u00B2')
'No'
>>> int('\u00B2')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '²'
Is this worth an FAQ or an addition to the existing note on int that specifies characters should belong to 'Nd' category to add a note that str.isdecimal should return True
|
msg336462 - (view) |
Author: Steven D'Aprano (steven.daprano) * |
Date: 2019-02-24 11:13 |
On Sun, Feb 24, 2019 at 11:07:41AM +0000, Karthikeyan Singaravelan wrote:
> Is this worth an FAQ or an addition to the existing note on int that
> specifies characters should belong to 'Nd' category to add a note that
> str.isdecimal should return True
Yes, I think that there should be a FAQ about the differences between
isdigit, isdecimal and isnumeric, pointing to the relevant Unicode
documentation. I would also like to see a briefer note added to each of
the string methods docstrings as well.
|
msg336464 - (view) |
Author: Karthikeyan Singaravelan (xtreak) * |
Date: 2019-02-24 11:44 |
Agreed, though str.isnumeric behavior might seem to be correct in terms of user who knows unicode internals the naming makes it easy to be used for a general user on trying to determine if the string can be used for int() without knowing unicode internals. I am not sure how this can be explained in simpler terms but it would be good if clarified in the docs to avoid confusion.
There seems to be have been thread [0] in the past about multiple ways to check for a unicode literal to be number causing confusion. It adds more confusion on Python 2 where strings are not unicode by default.
$ python2.7
Python 2.7.14 (default, Mar 12 2018, 13:54:56)
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> '\u00B2'.isdigit()
False
>>> u'\u00B2'.isdigit()
True
[0] https://mail.python.org/pipermail/python-list/2012-May/624340.html
|
msg336466 - (view) |
Author: Marcos Dione (StyXman) * |
Date: 2019-02-24 12:39 |
Thanks for all the examples, I'm convinced.
|
msg336467 - (view) |
Author: Steven D'Aprano (steven.daprano) * |
Date: 2019-02-24 13:32 |
I'm re-opening the ticket with a change of subject, because I think this should be treated as a documentation enhancement:
- improve the docstrings for str.isdigit, isnumeric and isdecimal to make it clear what each does (e.g. what counts as a digit);
- similarly improve the documentation for int and float? although the existing comment may be sufficient
https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex
- add a FAQ summarizing the situation.
I don't think we need to worry about backporting the docs to Python 2, but if others disagree, I won't object.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:59:11 | admin | set | github: 80281 |
2019-02-24 13:32:58 | steven.daprano | set | status: closed -> open
resolution: not a bug ->
assignee: docs@python stage: resolved -> title: int() and float() should accept any isnumeric() digit -> Document the differences between str.isdigit, isdecimal and isnumeric nosy:
+ docs@python versions:
- Python 2.7, Python 3.4, Python 3.5, Python 3.6, Python 3.7, Python 3.8 messages:
+ msg336467 components:
+ Documentation, - Library (Lib) type: behavior -> enhancement |
2019-02-24 12:39:23 | StyXman | set | status: open -> closed versions:
+ Python 3.4, Python 3.5, Python 3.6 messages:
+ msg336466
resolution: not a bug stage: resolved |
2019-02-24 11:44:36 | xtreak | set | messages:
+ msg336464 versions:
- Python 3.4, Python 3.5, Python 3.6 |
2019-02-24 11:13:40 | steven.daprano | set | messages:
+ msg336462 |
2019-02-24 11:07:41 | xtreak | set | messages:
+ msg336461 |
2019-02-24 10:58:35 | mark.dickinson | set | messages:
+ msg336460 |
2019-02-24 10:57:40 | mark.dickinson | set | messages:
+ msg336459 |
2019-02-24 10:31:50 | xtreak | set | nosy:
+ xtreak messages:
+ msg336456
|
2019-02-24 10:24:17 | mark.dickinson | set | messages:
+ msg336455 |
2019-02-24 10:14:50 | mark.dickinson | set | nosy:
+ mark.dickinson messages:
+ msg336454
|
2019-02-24 10:05:25 | steven.daprano | set | nosy:
+ steven.daprano messages:
+ msg336453
|
2019-02-24 09:02:22 | StyXman | create | |