Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review and document string format accepted in numeric data type constructors #54790

Open
abalkin opened this issue Nov 29, 2010 · 16 comments
Open
Assignees
Labels
docs Documentation in the Doc dir interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Comments

@abalkin
Copy link
Member

abalkin commented Nov 29, 2010

BPO 10581
Nosy @malemburg, @loewis, @mdickinson, @ncoghlan, @abalkin, @vstinner, @ericvsmith, @ezio-melotti, @merwok, @serhiy-storchaka

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = 'https://github.com/abalkin'
closed_at = None
created_at = <Date 2010-11-29.17:55:50.898>
labels = ['interpreter-core', 'type-feature', 'docs']
title = 'Review and document string format accepted in numeric data type constructors'
updated_at = <Date 2014-10-14.15:12:17.129>
user = 'https://github.com/abalkin'

bugs.python.org fields:

activity = <Date 2014-10-14.15:12:17.129>
actor = 'skrah'
assignee = 'belopolsky'
closed = False
closed_date = None
closer = None
components = ['Documentation', 'Interpreter Core']
creation = <Date 2010-11-29.17:55:50.898>
creator = 'belopolsky'
dependencies = []
files = []
hgrepos = []
issue_num = 10581
keywords = []
message_count = 16.0
messages = ['122834', '122835', '135469', '135536', '135977', '190949', '191011', '191014', '191081', '191101', '191105', '191302', '191304', '191314', '191709', '191720']
nosy_count = 11.0
nosy_names = ['lemburg', 'loewis', 'mark.dickinson', 'ncoghlan', 'belopolsky', 'vstinner', 'eric.smith', 'ezio.melotti', 'eric.araujo', 'cvrebert', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue10581'
versions = ['Python 3.3', 'Python 3.4']

@abalkin
Copy link
Member Author

abalkin commented Nov 29, 2010

I am opening a new report to continue work on the issues raised in bpo-10557 that are either feature requests or documentation bugs.

The rest is my reply to the relevant portions of Marc's comment at msg122785.

On Mon, Nov 29, 2010 at 4:41 AM, Marc-Andre Lemburg <report@bugs.python.org> wrote:
..

Alexander Belopolsky wrote:
>
> Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:
>
> After a bit of svn archeology, it does appear that Arabic-Indic
> digits' support was deliberate at least in the sense that the
> feature was tested for when the code was first committed. See r15000.

As I mentioned on python-dev (http://mail.python.org/pipermail/python-dev/2010-November/106077.html)
this support was added intentionally.

> The test migrated from file to file over the last 10 years, but it
> is still present in test_float.py:
>
>         self.assertEqual(float(b"  \u0663.\u0661\u0664  ".decode('raw-unicode-escape')), 3.14)
>
> (It should probably be now rewritten using a string literal.)
>
..
> For the future, I note that starting with Unicode 6.0.0,
> the Unicode Consortium promises that
>
> """
> Characters with the property value Numeric_Type=de (Decimal) only
> occur in contiguous ranges of 10 characters, with ascending numeric
> values from 0 to 9 (Numeric_Value=0..9).
> """
>
> This makes it very easy to check a numeric string does not contain
> a mix of digits from different scripts.

I'm not sure why you'd want to check for such ranges.

In order to disallow a mix of say Arabic-Indic and Bengali digits. Such combinations cannot be defended as possibly valid numbers in any script.

> I still believe that proper API should require explicit choice of
> language or locale before allowing digits other than 0-9 just as
> int() would not accept hexadecimal digits without explicit choice of
> base >= 16.  But this would be a subject of a feature request.

Since when do we require a locale or language to be specified when
using Unicode ?

This is a valid question. I may be in minority, but I find it convenient to use int(), float() etc. for data validation. If my program gets a CSV file with Arabic-Indic digits, I want to fire the guy who prepared it before it causes real issues. :-) I may be too strict, but I don't think anyone would want to see columns with a mix of Bengali and Devanagari numerals.

On the other hand there is certain convenience in promiscuous parsers, but this is not an expectation that I have from int() and friends. int('0xFF') requires me to specify base even though 0xFF is a perfectly valid notation.

There are pros and cons in any approach. Let's figure out what is better before we fix the documentation.

The codecs, Unicode methods and other Unicode support features
happily work with all kinds of languages, mixed or not, without any
such specification.

In my view int() and friends are only marginally related to Unicode and Unicode methods design is not directly relevant to their behavior. If we were designing str.todigits(), by all means, I would argue that it must be consistent with str.isdigit(). For numeric data, however, I think we should follow the logic that rejected int('0xFF').

This is my opinion. We can consider allowing int('0xFF') as well. Let's discuss.

@abalkin abalkin self-assigned this Nov 29, 2010
@abalkin abalkin added docs Documentation in the Doc dir interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement labels Nov 29, 2010
@abalkin
Copy link
Member Author

abalkin commented Nov 29, 2010

See also bpo-9574 for a somewhat related discussion.

@merwok
Copy link
Member

merwok commented May 7, 2011

I may be in minority, but I find it convenient to use int(), float()
etc. for data validation.
A number of libraries agree: argparse, HTML form handling libs, etc.

I may be too strict, but I don't think anyone would want to see
columns with a mix of Bengali and Devanagari numerals. [...]
On the other hand there is certain convenience in promiscuous
parsers, but this is not an expectation that I have from int() and
friends. [...] There are pros and cons in any approach.
Indeed, tough question. On one hand, I tend to agree that mixing Hindi/Arab numerals with Bengali does not make sense; on the other hand, rejecting it means that the int code does know about Unicode, which you argued against.

[MAL]
> The codecs, Unicode methods and other Unicode support features
> happily work with all kinds of languages, mixed or not, without any
> such specification.
In my view int() and friends are only marginally related to Unicode
and Unicode methods design is not directly relevant to their behavior.
I think I agree. It’s perfectly fine that Unicode support features don’t care about the type of the characters but just encode and decode; however, int has a validation step. It rejects numerals that don’t make sense with the given base for example, so rejecting nonsensical sequences of Unicode numerals makes sense IMO.

What do the other languages that are able to convert from Unicode numerals to integer objects?

@abalkin
Copy link
Member Author

abalkin commented May 8, 2011

On Sat, May 7, 2011 at 11:25 AM, Éric Araujo <report@bugs.python.org> wrote:

.. On one hand, I tend to agree that mixing Hindi/Arab numerals with Bengali does not make sense;
on the other hand, rejecting it means that the int code does know about Unicode, which you argued
against.

In order to flag use of mixed scripts in numerals, the code does not
require access to any additional unicode data. Since Unicode 6.0.0,
programmers can rely on the following stability promise:

"""
Characters with the property value Numeric_Type=de (Decimal) only
occur in contiguous ranges of 10 characters, with ascending numeric
values from 0 to 9 (Numeric_Value=0..9).
""" -- http://www.unicode.org/policies/stability_policy.html

Therefore, the validation code can simply check that for all digits in
the number, ord(d) - unicodedata.numeric(d) is the same.

@mdickinson
Copy link
Member

I find it convenient to use int(), float() etc. for data validation.

Me too. This is why I'd still be happiest with int and float not accepting non-ASCII digits at all. (And also why the recent suggestions to allow extra underscores in int and float input make me uneasy.)

@malemburg
Copy link
Member

I've changed my mind :-)

Restricting the decimal encoder to only accept code points from one of the possible decimal digit ranges is a good idea. Let's do that.

@abalkin
Copy link
Member Author

abalkin commented Jun 12, 2013

It looks like we a approaching consensus on some points:

  1. Mixed script numerals should be disallowed.
  2. '\N{MINUS SIGN}' should be accepted as an alternative to '\N{HYPHEN-MINUS}'

Open question: should we accept fullwidth + and -, sub/superscript variants etc.? I believe rather than debating variant codepoints one by one, we should consider applying NFKC (compatibility) normalization to unicode strings to be interpreted as numbers. This would allow parsing strings like this:

>>> float(normalize('NFKC', '\N{FULLWIDTH HYPHEN-MINUS}\N{DIGIT ONE FULL STOP}\N{FULLWIDTH DIGIT TWO}'))
-1.2

@malemburg
Copy link
Member

On 12.06.2013 07:32, Alexander Belopolsky wrote:
> 
> Alexander Belopolsky added the comment:
> 
> It looks like we a approaching consensus on some points:
> 
> 1. Mixed script numerals should be disallowed.
> 2. '\N{MINUS SIGN}' should be accepted as an alternative to '\N{HYPHEN-MINUS}'
> 
> Open question: should we accept fullwidth + and -, sub/superscript variants etc.?  I believe rather than debating variant codepoints one by one, we should consider applying NFKC (compatibility) normalization to unicode strings to be interpreted as numbers.  This would allow parsing strings like this:
> 
>>>> float(normalize('NFKC', '\N{FULLWIDTH HYPHEN-MINUS}\N{DIGIT ONE FULL STOP}\N{FULLWIDTH DIGIT TWO}'))
> -1.2

While it would solve these cases, I think that would cause a
significant performance hit.

Perhaps we could do this in two phases:

  1. detect whether the string uses non-ASCII digits and symbols
  2. if it does, apply normalization and then use the decimal codec

@ncoghlan
Copy link
Contributor

I think PEP-393 gives us a quick way to fast parsing: if the max char is < 128, just roll straight into normal processing, otherwise do the normalisation and "all decimal digits are from the same script" steps.

There are almost certainly better ways to do the script translation, but the example below tries to just do the "convert to ASCII" step to avoid duplicating the +/- and decimal point processing logic:

    if max_char(arg) >= 128:
        arg = toNFKC(arg)
        originals = set()
        converted = []
        for c in arg:
            try:
                d = str(unicodedata.decimal(c))
            except ValueError:
                d = c
            else:
                originals.add(c)
            converted.append(d)
        if (max(originals) - min(originals)) >= 10:
            raise ValueError("%s mixes digits from multiple scripts" % arg)
        arg = "".join(converted)
    result = parse_ascii_number(arg)

P.S. I don't think the base argument is especially applicable ('0x' is rejected because 'x' is not a base 10 digit and we allow a base of '0' to request "use int literal base markers").

@abalkin
Copy link
Member Author

abalkin commented Jun 14, 2013

PEP-393 implementation has already added the fast path to decimal encoding:

http://hg.python.org/cpython/diff/8beaa9a37387/Objects/unicodeobject.c#l1.3735

What we can do, however, is improve performance of converting non-ascii numerals by looking up only the first digit's value and converting the rest using simple:

value = code - (first_code - first_value)
if not 0 <= value < 10:
   raise or fall back to UCD lookup

@malemburg
Copy link
Member

On 14.06.2013 03:43, Alexander Belopolsky wrote:

Alexander Belopolsky added the comment:

PEP-393 implementation has already added the fast path to decimal encoding:

http://hg.python.org/cpython/diff/8beaa9a37387/Objects/unicodeobject.c#l1.3735

What we can do, however, is improve performance of converting non-ascii numerals by looking up only the first digit's value and converting the rest using simple:

value = code - (first_code - first_value)
if not 0 <= value < 10:
raise or fall back to UCD lookup

I'm not sure whether just relying on PEP-393 is good enough.

Of course, you can special case the conversion based on the
kind, but that's only one form of optimization.

Slicing operations don't recheck the max code point
used in the substring. As a result, a slice may very well
be of the UCS2 kind, even though the text itself is ASCII.

Apart from the fast-path based on the string kind,
I think the decimal encoder would also have to scan the
string for non-ASCII code points. If it finds non-ASCII
code points, it would have to call the normalizer and
restart the scan based on the normalized string.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jun 14 2013)
>>> Python Projects, Consulting and Support ...   http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ...       http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________
2013-07-01: EuroPython 2013, Florence, Italy ...           17 days to go
2013-07-16: Python Meeting Duesseldorf ...                 32 days to go

::::: Try our mxODBC.Connect Python Database Interface for free ! ::::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/

@abalkin
Copy link
Member Author

abalkin commented Jun 17, 2013

I took another look at the library reference and it looks like when it comes to non-ascii digits support, the reference contradicts itself. On one hand,

"""
int(x, base=10)

If x is not a number or if base is given, then x must be a string, bytes, or bytearray instance representing an integer literal in radix base. Optionally, the literal can be preceded by + or - (with no space in between) and surrounded by whitespace.
""" <http://docs.python.org/3/library/functions.html#int\>

.. suggests that only "an integer literal" will be accepted by int(), but on the other hand, a note in the "Numeric Types" section says: "The numeric literals accepted include the digits 0 to 9 or any Unicode equivalent (code points with the Nd property)." <http://docs.python.org/3/library/stdtypes.html#typesnumeric\>

It also appears that "surrounded by whitespace" part is not entirely correct:

>>> '\N{RS}'.isspace()
True
>>> int('123\N{RS}')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '123\x1e'

This is probably a bug in the current implementation and I will open a separate issue for that.

@abalkin
Copy link
Member Author

abalkin commented Jun 17, 2013

i opened bpo-18236 to address the issue of surrounding whitespace.

@abalkin
Copy link
Member Author

abalkin commented Jun 17, 2013

I have started a rough prototype for what I plan to eventually reimplement in C and propose as a patch here.

https://bitbucket.org/alexander_belopolsky/misc/src/c175171cc76e/utoi.py?at=master

Comments welcome.

@abalkin
Copy link
Member Author

abalkin commented Jun 23, 2013

Martin v. Löwis wrote at bpo-18236 (msg191687):

int conversion ultimately uses Py_ISSPACE, which conceptually could
deviate from the Unicode properties (as it is byte-based). This is not
really an issue, since they indeed match.

Py_ISSPACE matches Unicode White_Space property in the ASII range (first 128 code points) it differs for byte (code point) values from 128 through 255. This leads to the following discrepancy:

>>> int('123\xa0')
123

but

>>> int(b'123\xa0')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 3: invalid start byte
>>> int('123\xa0'.encode())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '123\xa0'

@serhiy-storchaka
Copy link
Member

For the last discrepancy see bpo-16741. It have a patch which should fix this.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

6 participants