New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Review and document string format accepted in numeric data type constructors #54790
Comments
I am opening a new report to continue work on the issues raised in bpo-10557 that are either feature requests or documentation bugs. The rest is my reply to the relevant portions of Marc's comment at msg122785. On Mon, Nov 29, 2010 at 4:41 AM, Marc-Andre Lemburg <report@bugs.python.org> wrote:
In order to disallow a mix of say Arabic-Indic and Bengali digits. Such combinations cannot be defended as possibly valid numbers in any script.
This is a valid question. I may be in minority, but I find it convenient to use int(), float() etc. for data validation. If my program gets a CSV file with Arabic-Indic digits, I want to fire the guy who prepared it before it causes real issues. :-) I may be too strict, but I don't think anyone would want to see columns with a mix of Bengali and Devanagari numerals. On the other hand there is certain convenience in promiscuous parsers, but this is not an expectation that I have from int() and friends. int('0xFF') requires me to specify base even though 0xFF is a perfectly valid notation. There are pros and cons in any approach. Let's figure out what is better before we fix the documentation.
In my view int() and friends are only marginally related to Unicode and Unicode methods design is not directly relevant to their behavior. If we were designing str.todigits(), by all means, I would argue that it must be consistent with str.isdigit(). For numeric data, however, I think we should follow the logic that rejected int('0xFF'). This is my opinion. We can consider allowing int('0xFF') as well. Let's discuss. |
See also bpo-9574 for a somewhat related discussion. |
What do the other languages that are able to convert from Unicode numerals to integer objects? |
On Sat, May 7, 2011 at 11:25 AM, Éric Araujo <report@bugs.python.org> wrote:
In order to flag use of mixed scripts in numerals, the code does not """ Therefore, the validation code can simply check that for all digits in |
Me too. This is why I'd still be happiest with int and float not accepting non-ASCII digits at all. (And also why the recent suggestions to allow extra underscores in int and float input make me uneasy.) |
I've changed my mind :-) Restricting the decimal encoder to only accept code points from one of the possible decimal digit ranges is a good idea. Let's do that. |
It looks like we a approaching consensus on some points:
Open question: should we accept fullwidth + and -, sub/superscript variants etc.? I believe rather than debating variant codepoints one by one, we should consider applying NFKC (compatibility) normalization to unicode strings to be interpreted as numbers. This would allow parsing strings like this: >>> float(normalize('NFKC', '\N{FULLWIDTH HYPHEN-MINUS}\N{DIGIT ONE FULL STOP}\N{FULLWIDTH DIGIT TWO}'))
-1.2 |
On 12.06.2013 07:32, Alexander Belopolsky wrote:
>
> Alexander Belopolsky added the comment:
>
> It looks like we a approaching consensus on some points:
>
> 1. Mixed script numerals should be disallowed.
> 2. '\N{MINUS SIGN}' should be accepted as an alternative to '\N{HYPHEN-MINUS}'
>
> Open question: should we accept fullwidth + and -, sub/superscript variants etc.? I believe rather than debating variant codepoints one by one, we should consider applying NFKC (compatibility) normalization to unicode strings to be interpreted as numbers. This would allow parsing strings like this:
>
>>>> float(normalize('NFKC', '\N{FULLWIDTH HYPHEN-MINUS}\N{DIGIT ONE FULL STOP}\N{FULLWIDTH DIGIT TWO}'))
> -1.2 While it would solve these cases, I think that would cause a Perhaps we could do this in two phases:
|
I think PEP-393 gives us a quick way to fast parsing: if the max char is < 128, just roll straight into normal processing, otherwise do the normalisation and "all decimal digits are from the same script" steps. There are almost certainly better ways to do the script translation, but the example below tries to just do the "convert to ASCII" step to avoid duplicating the +/- and decimal point processing logic: if max_char(arg) >= 128:
arg = toNFKC(arg)
originals = set()
converted = []
for c in arg:
try:
d = str(unicodedata.decimal(c))
except ValueError:
d = c
else:
originals.add(c)
converted.append(d)
if (max(originals) - min(originals)) >= 10:
raise ValueError("%s mixes digits from multiple scripts" % arg)
arg = "".join(converted)
result = parse_ascii_number(arg) P.S. I don't think the base argument is especially applicable ('0x' is rejected because 'x' is not a base 10 digit and we allow a base of '0' to request "use int literal base markers"). |
PEP-393 implementation has already added the fast path to decimal encoding: http://hg.python.org/cpython/diff/8beaa9a37387/Objects/unicodeobject.c#l1.3735 What we can do, however, is improve performance of converting non-ascii numerals by looking up only the first digit's value and converting the rest using simple: value = code - (first_code - first_value)
if not 0 <= value < 10:
raise or fall back to UCD lookup |
On 14.06.2013 03:43, Alexander Belopolsky wrote:
I'm not sure whether just relying on PEP-393 is good enough. Of course, you can special case the conversion based on the Slicing operations don't recheck the max code point Apart from the fast-path based on the string kind, -- Professional Python Services directly from the Source (#1, Jun 14 2013)
>>> Python Projects, Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________
2013-07-01: EuroPython 2013, Florence, Italy ... 17 days to go
2013-07-16: Python Meeting Duesseldorf ... 32 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 |
I took another look at the library reference and it looks like when it comes to non-ascii digits support, the reference contradicts itself. On one hand, """ If x is not a number or if base is given, then x must be a string, bytes, or bytearray instance representing an integer literal in radix base. Optionally, the literal can be preceded by + or - (with no space in between) and surrounded by whitespace. .. suggests that only "an integer literal" will be accepted by int(), but on the other hand, a note in the "Numeric Types" section says: "The numeric literals accepted include the digits 0 to 9 or any Unicode equivalent (code points with the Nd property)." <http://docs.python.org/3/library/stdtypes.html#typesnumeric\> It also appears that "surrounded by whitespace" part is not entirely correct: >>> '\N{RS}'.isspace()
True
>>> int('123\N{RS}')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '123\x1e' This is probably a bug in the current implementation and I will open a separate issue for that. |
i opened bpo-18236 to address the issue of surrounding whitespace. |
I have started a rough prototype for what I plan to eventually reimplement in C and propose as a patch here. https://bitbucket.org/alexander_belopolsky/misc/src/c175171cc76e/utoi.py?at=master Comments welcome. |
Martin v. Löwis wrote at bpo-18236 (msg191687):
Py_ISSPACE matches Unicode White_Space property in the ASII range (first 128 code points) it differs for byte (code point) values from 128 through 255. This leads to the following discrepancy: >>> int('123\xa0')
123 but >>> int(b'123\xa0')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 3: invalid start byte
>>> int('123\xa0'.encode())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '123\xa0' |
For the last discrepancy see bpo-16741. It have a patch which should fix this. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: