classification
Title: Review and document string format accepted in numeric data type constructors
Type: enhancement Stage:
Components: Documentation, Interpreter Core Versions: Python 3.3, Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: belopolsky Nosy List: belopolsky, cvrebert, eric.araujo, eric.smith, ezio.melotti, lemburg, loewis, mark.dickinson, ncoghlan, serhiy.storchaka, vstinner
Priority: normal Keywords:

Created on 2010-11-29 17:55 by belopolsky, last changed 2014-10-14 15:12 by skrah.

Messages (16)
msg122834 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-29 17:55
I am opening a new report to continue work on the issues raised in #10557 that are either feature requests or documentation bugs.

The rest is my reply to the relevant portions of Marc's comment at msg122785.

On Mon, Nov 29, 2010 at 4:41 AM, Marc-Andre Lemburg <report@bugs.python.org> wrote:
..
> Alexander Belopolsky wrote:
>>
>> Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:
>>
>> After a bit of svn archeology, it does appear that Arabic-Indic
>> digits' support was deliberate at least in the sense that the
>> feature was tested for when the code was first committed. See r15000.
>
> As I mentioned on python-dev (http://mail.python.org/pipermail/python-dev/2010-November/106077.html)
> this support was added intentionally.
>
>> The test migrated from file to file over the last 10 years, but it
>> is still present in test_float.py:
>>
>>         self.assertEqual(float(b"  \u0663.\u0661\u0664  ".decode('raw-unicode-escape')), 3.14)
>>
>> (It should probably be now rewritten using a string literal.)
>>
..
>> For the future, I note that starting with Unicode 6.0.0,
>> the Unicode Consortium promises that
>>
>> """
>> Characters with the property value Numeric_Type=de (Decimal) only
>> occur in contiguous ranges of 10 characters, with ascending numeric
>> values from 0 to 9 (Numeric_Value=0..9).
>> """
>>
>> This makes it very easy to check a numeric string does not contain
>> a mix of digits from different scripts.
>
> I'm not sure why you'd want to check for such ranges.
>

In order to disallow a mix of say Arabic-Indic and Bengali digits.  Such combinations cannot be defended as possibly valid numbers in any script.

>> I still believe that proper API should require explicit choice of
>> language or locale before allowing digits other than 0-9 just as
>> int() would not accept hexadecimal digits without explicit choice of
>> base >= 16.  But this would be a subject of a feature request.
>
> Since when do we require a locale or language to be specified when
> using Unicode ?
>

This is a valid question.  I may be in minority, but I find it convenient to use int(), float() etc. for data validation.  If my program gets a CSV file with Arabic-Indic digits, I want to fire the guy who prepared it before it causes real issues. :-)  I may be too strict, but I don't think anyone would want to see columns with a mix of Bengali and Devanagari numerals.

On the other hand there is certain convenience in promiscuous parsers, but this is not an expectation that I have from int() and friends.  int('0xFF') requires me to specify base even though 0xFF is a perfectly valid notation.

There are pros and cons in any approach.  Let's figure out what is better before we fix the documentation.

> The codecs, Unicode methods and other Unicode support features
> happily work with all kinds of languages, mixed or not, without any
> such specification.

In my view int() and friends are only marginally related to Unicode and Unicode methods design is not directly relevant to their behavior.  If we were designing str.todigits(), by all means, I would argue that it must be consistent with str.isdigit().  For numeric data, however, I think we should follow the logic that rejected int('0xFF').

This is my opinion.  We can consider allowing int('0xFF') as well.  Let's discuss.
msg122835 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-29 17:57
See also issue 9574 for a somewhat related discussion.
msg135469 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-05-07 15:25
> I may be in minority, but I find it convenient to use int(), float()
> etc. for data validation.
A number of libraries agree: argparse, HTML form handling libs, etc.

> I may be too strict, but I don't think anyone would want to see
> columns with a mix of Bengali and Devanagari numerals. [...]
> On the other hand there is certain convenience in promiscuous
> parsers, but this is not an expectation that I have from int() and
> friends. [...] There are pros and cons in any approach.
Indeed, tough question.  On one hand, I tend to agree that mixing Hindi/Arab numerals with Bengali does not make sense; on the other hand, rejecting it means that the int code does know about Unicode, which you argued against.

>[MAL]
>> The codecs, Unicode methods and other Unicode support features
>> happily work with all kinds of languages, mixed or not, without any
>> such specification.
> In my view int() and friends are only marginally related to Unicode
> and Unicode methods design is not directly relevant to their behavior.
I think I agree.  It’s perfectly fine that Unicode support features don’t care about the type of the characters but just encode and decode; however, int has a validation step.  It rejects numerals that don’t make sense with the given base for example, so rejecting nonsensical sequences of Unicode numerals makes sense IMO.

What do the other languages that are able to convert from Unicode numerals to integer objects?
msg135536 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2011-05-08 17:21
On Sat, May 7, 2011 at 11:25 AM, Éric Araujo <report@bugs.python.org> wrote:
> .. On one hand, I tend to agree that mixing Hindi/Arab numerals with Bengali does not make sense;
> on the other hand, rejecting it means that the int code does know about Unicode, which you argued
> against.

In order to flag use of mixed scripts in numerals, the code does not
require access to any additional unicode data.  Since Unicode 6.0.0,
programmers can rely on the following stability promise:

"""
Characters with the property value Numeric_Type=de (Decimal) only
occur in contiguous ranges of 10 characters, with ascending numeric
values from 0 to 9 (Numeric_Value=0..9).
"""  -- http://www.unicode.org/policies/stability_policy.html

Therefore, the validation code can simply check that for all digits in
the number, ord(d) - unicodedata.numeric(d) is the same.
msg135977 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2011-05-14 15:52
> I find it convenient to use int(), float() etc. for data validation.

Me too.  This is why I'd still be happiest with int and float not accepting non-ASCII digits at all.  (And also why the recent suggestions to allow extra underscores in int and float input make me uneasy.)
msg190949 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-06-11 07:05
I've changed my mind :-)

Restricting the decimal encoder to only accept code points from one of the possible decimal digit ranges is a good idea. Let's do that.
msg191011 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2013-06-12 05:32
It looks like we a approaching consensus on some points:

1. Mixed script numerals should be disallowed.
2. '\N{MINUS SIGN}' should be accepted as an alternative to '\N{HYPHEN-MINUS}'

Open question: should we accept fullwidth + and -, sub/superscript variants etc.?  I believe rather than debating variant codepoints one by one, we should consider applying NFKC (compatibility) normalization to unicode strings to be interpreted as numbers.  This would allow parsing strings like this:

>>> float(normalize('NFKC', '\N{FULLWIDTH HYPHEN-MINUS}\N{DIGIT ONE FULL STOP}\N{FULLWIDTH DIGIT TWO}'))
-1.2
msg191014 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-06-12 07:05
On 12.06.2013 07:32, Alexander Belopolsky wrote:
> 
> Alexander Belopolsky added the comment:
> 
> It looks like we a approaching consensus on some points:
> 
> 1. Mixed script numerals should be disallowed.
> 2. '\N{MINUS SIGN}' should be accepted as an alternative to '\N{HYPHEN-MINUS}'
> 
> Open question: should we accept fullwidth + and -, sub/superscript variants etc.?  I believe rather than debating variant codepoints one by one, we should consider applying NFKC (compatibility) normalization to unicode strings to be interpreted as numbers.  This would allow parsing strings like this:
> 
>>>> float(normalize('NFKC', '\N{FULLWIDTH HYPHEN-MINUS}\N{DIGIT ONE FULL STOP}\N{FULLWIDTH DIGIT TWO}'))
> -1.2

While it would solve these cases, I think that would cause a
significant performance hit.

Perhaps we could do this in two phases:
1. detect whether the string uses non-ASCII digits and symbols
2. if it does, apply normalization and then use the decimal codec
msg191081 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-06-13 12:06
I think PEP 393 gives us a quick way to fast parsing: if the max char is < 128, just roll straight into normal processing, otherwise do the normalisation and "all decimal digits are from the same script" steps.

There are almost certainly better ways to do the script translation, but the example below tries to just do the "convert to ASCII" step to avoid duplicating the +/- and decimal point processing logic:

    if max_char(arg) >= 128:
        arg = toNFKC(arg)
        originals = set()
        converted = []
        for c in arg:
            try:
                d = str(unicodedata.decimal(c))
            except ValueError:
                d = c
            else:
                originals.add(c)
            converted.append(d)
        if (max(originals) - min(originals)) >= 10:
            raise ValueError("%s mixes digits from multiple scripts" % arg)
        arg = "".join(converted)
    result = parse_ascii_number(arg)


P.S. I don't think the base argument is especially applicable ('0x' is rejected because 'x' is not a base 10 digit and we allow a base of '0' to request "use int literal base markers").
msg191101 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2013-06-14 01:43
PEP 393 implementation has already added the fast path to decimal encoding:

http://hg.python.org/cpython/diff/8beaa9a37387/Objects/unicodeobject.c#l1.3735

What we can do, however, is improve performance of converting non-ascii numerals by looking up only the first digit's value and converting the rest using simple:

value = code - (first_code - first_value)
if not 0 <= value < 10:
   raise or fall back to UCD lookup
msg191105 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-06-14 07:53
On 14.06.2013 03:43, Alexander Belopolsky wrote:
> 
> Alexander Belopolsky added the comment:
> 
> PEP 393 implementation has already added the fast path to decimal encoding:
> 
> http://hg.python.org/cpython/diff/8beaa9a37387/Objects/unicodeobject.c#l1.3735
> 
> What we can do, however, is improve performance of converting non-ascii numerals by looking up only the first digit's value and converting the rest using simple:
> 
> value = code - (first_code - first_value)
> if not 0 <= value < 10:
>    raise or fall back to UCD lookup

I'm not sure whether just relying on PEP 393 is good enough.

Of course, you can special case the conversion based on the
kind, but that's only one form of optimization.

Slicing operations don't recheck the max code point
used in the substring. As a result, a slice may very well
be of the UCS2 kind, even though the text itself is ASCII.

Apart from the fast-path based on the string kind,
I think the decimal encoder would also have to scan the
string for non-ASCII code points. If it finds non-ASCII
code points, it would have to call the normalizer and
restart the scan based on the normalized string.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jun 14 2013)
>>> Python Projects, Consulting and Support ...   http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ...       http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________
2013-07-01: EuroPython 2013, Florence, Italy ...           17 days to go
2013-07-16: Python Meeting Duesseldorf ...                 32 days to go

::::: Try our mxODBC.Connect Python Database Interface for free ! ::::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
msg191302 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2013-06-17 00:56
I took another look at the library reference and it looks like when it comes to non-ascii digits support, the reference contradicts itself.  On one hand,

"""
int(x, base=10)

If x is not a number or if base is given, then x must be a string, bytes, or bytearray instance representing an integer literal in radix base. Optionally, the literal can be preceded by + or - (with no space in between) and surrounded by whitespace.
""" <http://docs.python.org/3/library/functions.html#int>

.. suggests that only "an integer literal" will be accepted by int(), but on the other hand, a note in the "Numeric Types" section says: "The numeric literals accepted include the digits 0 to 9 or any Unicode equivalent (code points with the Nd property)." <http://docs.python.org/3/library/stdtypes.html#typesnumeric>

It also appears that "surrounded by whitespace" part is not entirely correct:

>>> '\N{RS}'.isspace()
True
>>> int('123\N{RS}')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '123\x1e'

This is probably a bug in the current implementation and I will open a separate issue for that.
msg191304 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2013-06-17 01:07
i opened issue18236 to address the issue of surrounding whitespace.
msg191314 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2013-06-17 04:46
I have started a rough prototype for what I plan to eventually reimplement in C and propose as a patch here.

https://bitbucket.org/alexander_belopolsky/misc/src/c175171cc76e/utoi.py?at=master

Comments welcome.
msg191709 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2013-06-23 16:59
Martin v. Löwis wrote at #18236 (msg191687):
> int conversion ultimately uses Py_ISSPACE, which conceptually could
> deviate from the Unicode properties (as it is byte-based). This is not
> really an issue, since they indeed match.

Py_ISSPACE matches Unicode White_Space property in the ASII range (first 128 code points) it differs for byte (code point) values from 128 through 255.  This leads to the following discrepancy:

>>> int('123\xa0')
123

but

>>> int(b'123\xa0')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 3: invalid start byte
>>> int('123\xa0'.encode())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '123\xa0'
msg191720 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-06-23 19:06
For the last discrepancy see issue16741. It have a patch which should fix this.
History
Date User Action Args
2014-10-14 15:12:17skrahsetnosy: - skrah
2013-06-23 19:06:17serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg191720
2013-06-23 17:00:17belopolskysetnosy: + loewis
2013-06-23 16:59:58belopolskysetmessages: + msg191709
2013-06-17 04:46:59belopolskysetmessages: + msg191314
2013-06-17 01:07:29belopolskysetmessages: + msg191304
2013-06-17 00:56:47belopolskysetmessages: + msg191302
versions: + Python 3.4
2013-06-14 07:53:40lemburgsetmessages: + msg191105
2013-06-14 01:43:38belopolskysetmessages: + msg191101
2013-06-13 12:06:58ncoghlansetnosy: + ncoghlan
messages: + msg191081
2013-06-12 07:05:08lemburgsetmessages: + msg191014
2013-06-12 05:32:17belopolskysetmessages: + msg191011
2013-06-12 05:01:33belopolskylinkissue6632 superseder
2013-06-11 07:05:50lemburgsetmessages: + msg190949
2013-06-11 05:36:27cvrebertsetnosy: + cvrebert
2011-05-14 15:52:12mark.dickinsonsetmessages: + msg135977
2011-05-08 17:21:04belopolskysetmessages: + msg135536
2011-05-07 15:25:55eric.araujosetnosy: + eric.araujo
messages: + msg135469
2010-11-29 17:57:47belopolskysetmessages: + msg122835
2010-11-29 17:55:50belopolskycreate