classification
Title: Malformed error message from float()
Type: behavior Stage: commit review
Components: Interpreter Core Versions: Python 3.2, Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: belopolsky Nosy List: belopolsky, eric.smith, ezio.melotti, lemburg, mark.dickinson, serhiy.storchaka, skrah
Priority: normal Keywords: patch

Created on 2010-11-28 03:22 by belopolsky, last changed 2013-02-10 18:12 by mark.dickinson. This issue is now closed.

Files
File name Uploaded Description Edit
float-error.diff belopolsky, 2010-11-28 03:22 review
issue10557.diff belopolsky, 2010-11-29 06:23 review
issue10557a.diff belopolsky, 2010-11-29 15:39 review
issue10557b.diff belopolsky, 2010-12-02 17:38 review
issue10557c.diff belopolsky, 2010-12-03 02:53 review
issue10557d.diff belopolsky, 2010-12-04 03:04 review
Messages (45)
msg122612 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-28 03:22
>>> float('½')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: could not convert string to float: �

>>> float('42½')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError

With the attached patch, float-error.diff


>>> float('½')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for float(): ½
>>> float('42½')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for float(): 42½

Note that the proposed patch also has an effect of disallowing non-ascii digits in float() constructor.

Before the patch:

>>> float('١٢٣٤.٥٦')
1234.56

After the patch:

>>> float('١٢٣٤.٥٦')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: could not convert string to float: ١٢٣٤.٥٦

I am not sure, whether support for non-ascii digits in float() constructor is worth maintaining.  (Anyone knows whether Arabic numbers are written right to left or left to right?  What is the proper decimal point character?)

Also, I don't think users expect UnicodeEncodeError from float() or int().

Before the patch:

>>> float('\uffff')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'decimal' codec can't encode character '\uffff' in position 0: invalid decimal Unicode string


After the patch:

>>> float('\uffff')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: could not convert string to float: ￿
msg122625 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-11-28 04:29
I think float() should support non-ascii digits but I agree that it would be better to avoid UnicodeErrors and convert them to ValueErrors so that

>>> float('١٢٣٤.٥٦')
1234.56

and

>>> float('½')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for float(): ½

I.e. float should do the C equivalent of:
try:
  s = arg.encode('decimal')
except UnicodeEncodeError:
  raise ValueError('Invalid liter for float(): {}'.format(arg))

Note that int() and Decimal() supports non-ascii chars too.
msg122634 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-11-28 05:37
FWIW the UnicodeError comes from PyUnicode_EncodeDecimal (unicodeobject.c:6212) and the "ValueError: could not convert string to float" with the buggy � comes from PyOS_string_to_double (pystrtod.c:316).  Maybe PyOS_string_to_double should be fixed to display the string correctly, but I don't know what will be affected from that (and if it will make things worse or better).  The UnicodeError can be fixed in PyFloat_FromString (floatobject.c:174).
msg122679 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-11-28 16:36
> I am not sure, whether support for non-ascii digits in float()
> constructor is worth maintaining.

I'd be very happy to drop such support.  If you allow alternative digit sets, trying to work out exactly what should be supported and what shouldn't gets very messy;  it's even worse for int(), where bases > 10 have to be taken into account.

> (Anyone knows whether Arabic numbers are written right to left
> or left to right?  What is the proper decimal point character?)

Well, judging by the chocolate packaging I saw recently, they're written left to right (so presumably if you're reading right-to-left, you see the units first, then the tens, etc., which always struck me as the 'right' way to write things in the first place :-).  No idea about the proper decimal point character, though.
msg122683 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-11-28 17:00
About Alexander's solution:  might it make more sense to have PyUnicode_EncodeDecimal raise for inputs like this?  I see it as PyUnicode_EncodeDecimal's job to turn the unicode input into usable ASCII (or raise an exception);  it looks like that's not happening here.

Adding MAL to the nosy in case he wants to comment on this.
msg122684 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-28 17:24
On Sun, Nov 28, 2010 at 11:37 AM, Mark Dickinson <report@bugs.python.org> wrote:
>
> Mark Dickinson <dickinsm@gmail.com> added the comment:
>
>> I am not sure, whether support for non-ascii digits in float()
>> constructor is worth maintaining.
>
> I'd be very happy to drop such support.  If you allow alternative digit sets, trying to work out exactly what should
> be supported and what shouldn't gets very messy;  it's even worse for int(), where bases > 10 have to be taken
> into account.
>

I think there is a proposal (if not already in Unicode 6.0) to add
hexadecimal value property to UCD, but I don't think it is right for
int() and float() to depend on UCD in the first place.  If people need
to process exotic decimals, we can expose 'decimal' encoding, however,
translating from code-point to decimal value is trivial - just
subtract the code-point value for zero.

>> (Anyone knows whether Arabic numbers are written right to left
>> or left to right?  What is the proper decimal point character?)
>
> Well, judging by the chocolate packaging I saw recently, they're written left to right
> (so presumably if you're reading right-to-left, you see the units first, then the tens, etc.,
> which always struck me as the 'right' way to write things in the first place :-).

Did your chocolate packaging use European digits or Arabic-Indic ones?
 Note that they have different bidi properties:

'EN'
>>> unicodedata.bidirectional('\N{ARABIC-INDIC DIGIT ONE}')
'AN'

(I am not sure what 'AN' (Arabic Numeral?) bidi property actually mean.)
msg122687 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-11-28 17:33
> Did your chocolate packaging use European digits or Arabic-Indic ones?
> Note that they have different bidi properties:

Good question;  I think it was Arabic-Indic digits, but to be honest I don't remember.  (It wasn't *all* that recently.)
msg122688 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-28 17:34
On Sun, Nov 28, 2010 at 12:00 PM, Mark Dickinson <report@bugs.python.org> wrote:
>
> Mark Dickinson <dickinsm@gmail.com> added the comment:
>
> About Alexander's solution:  might it make more sense to have PyUnicode_EncodeDecimal raise
> for inputs like this?

No, I think PyOS_string_to_double() can generate better error messages
than  PyUnicode_EncodeDecimal.  It is important to pass losslessly
encoded string to PyOS_string_to_double() for proper error reporting.
 Otherwise, we will have to catch the error in PyFloat_FromString()
just to add the string value to the message and may loose other
information such as the precise location of the offending character.
(AFAICT, we don't make use of it now, but this would be a meaningful
improvement.)

> I see it as PyUnicode_EncodeDecimal's job to turn the unicode input into usable ASCII
> (or raise an exception);  it looks like that's not happening here.

UTF-8 is quite usable by PyOS_string_to_double() .  UTF-8 encoder is
extremely fast and will only get faster over time.  In my opinion,
PyUnicode_EncodeDecimal() is either unnecessary or should be exposed
as a codec.
msg122689 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2010-11-28 17:48
> PyUnicode_EncodeDecimal() is either unnecessary or should be exposed
> as a codec.

I'm depending on PyUnicode_EncodeDecimal in cdecimal. In fact, it saved
me quite a bit of trouble. I wouldn't be surprised if other extension
writers use it as well.
msg122691 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-11-28 18:01
Mark Dickinson wrote:
> 
> Mark Dickinson <dickinsm@gmail.com> added the comment:
> 
> About Alexander's solution:  might it make more sense to have PyUnicode_EncodeDecimal raise for inputs like this?  I see it as PyUnicode_EncodeDecimal's job to turn the unicode input into usable ASCII (or raise an exception);  it looks like that's not happening here.
> 
> Adding MAL to the nosy in case he wants to comment on this.

The purpose of the PyUnicode_EncodeDecimal() API is to convert
Unicode decimal digits to ASCII digits for interpretation by
the other usual functions to convert the ASCII representation to
numbers.

The proposed patch will not work, since it removes the support for
non-ASCII number code points, e.g. Asian number code points.

You might want to replace the error message by something more
related to floats in the float constructor. Note that UnicodeErrors
are subclasses of ValueErrors, so the errors are not unexpected
for number constructors.
msg122693 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-11-28 18:09
>>>> float('½')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> ValueError: could not convert string to float: �
> 
>>>> float('42½')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> ValueError

Note that fractional Unicode code points are not supported
by the encoding function.  Neither are code points which do
not evaluate to 0-9, e.g. ones that represent numbers
larger than 9.
msg122698 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-28 18:37
Issue #10567 demonstrated the problem of relying on the Unicode
database in Python builtins.  Apparently, Unicode does not guarantee
stability of the character categories.   On the other hand, we are
already tied to UCD for the language definition.  Maybe Python should
document the version of Unicode it is using in any given version and
possibly upgrade to Unicode 6.0 should be postponed until the language
moratorium is lifted.
msg122716 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2010-11-28 20:09
> (Anyone knows whether Arabic numbers are written right to left or left 
> to right?  What is the proper decimal point character?)

Thousands separator: U+066C
Decimal point: U+066B

١٢٣٬١٢٣٫١٢ should be: 123,123.12


Wikipedia says that digits are arranged in the usual way, lowest
value digit to the right.
msg122719 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-28 20:22
On Sun, Nov 28, 2010 at 3:09 PM, Stefan Krah <report@bugs.python.org> wrote:
..
> Decimal point: U+066B

Well, not so fast:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'decimal' codec can't encode character '\u066b' in
position 1: invalid decimal Unicode string
msg122723 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2010-11-28 20:30
> UnicodeEncodeError: 'decimal' codec can't encode character '\u066b'

Hmm, looks like a bug? I think U+066B is correct.
msg122726 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-28 20:38
Sending this by e-mail was not a good idea ...

On Sun, Nov 28, 2010 at 3:30 PM, Stefan Krah <report@bugs.python.org> wrote:
..
>> UnicodeEncodeError: 'decimal' codec can't encode character '\u066b'
>
> Hmm, looks like a bug? I think U+066B is correct.
>
Really?  What about

>>> float('1234.56')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'decimal' codec can't encode character '\uff0e' in
position 4: invalid decimal Unicode string

.. and where do we draw the line?  Note that I am not against
Decimal() accepting any c with c.isdigit() returning True, but
builtins should be less promiscuous IMO.
msg122775 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-29 06:23
After a bit of svn archeology, it does appear that Arabic-Indic digits' support was deliberate at least in the sense that the feature was tested for when the code was first committed. See r15000.

The test migrated from file to file over the last 10 years, but it is still present in test_float.py:

        self.assertEqual(float(b"  \u0663.\u0661\u0664  ".decode('raw-unicode-escape')), 3.14)

(It should probably be now rewritten using a string literal.)

I am now attaching the patch (issue10557.diff) that fixes the bug without sacrificing non-ASCII digit support.

If this approach is well-received, I would like to replace all calls to PyUnicode_EncodeDecimal() with the calls to the new _PyUnicode_EncodeDecimalUTF8() and deprecate Latin-1-oriented PyUnicode_EncodeDecimal().

For the future, I note that starting with Unicode 6.0.0, the Unicode Consortium promises that

"""
Characters with the property value Numeric_Type=de (Decimal) only occur in contiguous ranges of 10 characters, with ascending numeric values from 0 to 9 (Numeric_Value=0..9).
"""

This makes it very easy to check a numeric string does not contain a mix of digits from different scripts.

I still believe that proper API should require explicit choice of language or locale before allowing digits other than 0-9 just as int() would not accept hexadecimal digits without explicit choice of base >= 16.  But this would be a subject of a feature request.
msg122785 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-11-29 09:41
Alexander Belopolsky wrote:
> 
> Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:
> 
> After a bit of svn archeology, it does appear that Arabic-Indic digits' support was deliberate at least in the sense that the feature was tested for when the code was first committed. See r15000.

As I mentioned on python-dev (http://mail.python.org/pipermail/python-dev/2010-November/106077.html)
this support was added intentionally.

> The test migrated from file to file over the last 10 years, but it is still present in test_float.py:
> 
>         self.assertEqual(float(b"  \u0663.\u0661\u0664  ".decode('raw-unicode-escape')), 3.14)
> 
> (It should probably be now rewritten using a string literal.)
> 
> I am now attaching the patch (issue10557.diff) that fixes the bug without sacrificing non-ASCII digit support.
> If this approach is well-received, I would like to replace all calls to PyUnicode_EncodeDecimal() with the calls to the new _PyUnicode_EncodeDecimalUTF8() and deprecate Latin-1-oriented PyUnicode_EncodeDecimal().

It would be better to copy and iterate over the Unicode string first,
replacing any decimal code points with ASCII ones and then call the
UTF-8 encoder.

The code as it stands is very inefficient, since it will most likely
run the memcpy() part for every code point after the first non-ASCII
decimal one.

> For the future, I note that starting with Unicode 6.0.0, the Unicode Consortium promises that
> 
> """
> Characters with the property value Numeric_Type=de (Decimal) only occur in contiguous ranges of 10 characters, with ascending numeric values from 0 to 9 (Numeric_Value=0..9).
> """
> 
> This makes it very easy to check a numeric string does not contain a mix of digits from different scripts.

I'm not sure why you'd want to check for such ranges.

> I still believe that proper API should require explicit choice of language or locale before allowing digits other than 0-9 just as int() would not accept hexadecimal digits without explicit choice of base >= 16.  But this would be a subject of a feature request.

Since when do we require a locale or language to be specified when
using Unicode ?

The codecs, Unicode methods and other Unicode support features
happily work with all kinds of languages, mixed or not, without any
such specification.
msg122816 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-29 15:39
On Mon, Nov 29, 2010 at 4:41 AM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:
..
> It would be better to copy and iterate over the Unicode string first,
> replacing any decimal code points with ASCII ones and then call the
> UTF-8 encoder.
>

Good idea.

> The code as it stands is very inefficient, since it will most likely
> run the memcpy() part for every code point after the first non-ASCII
> decimal one.
>

I doubt there are measurable gains from this optimization, but doing
conversion in Unicode characters results in cleaner API.  The new
patch, issue10557a.diff, implements
_PyUnicode_NormalizeDecimal(Py_UNICODE *s, Py_ssize_t length) which is
defined as follows:

/* Strip leading and trailing space and convert code points that have
decimal
   digit property to the corresponding ASCII digit code point.

   Returns a new Unicode string on success, NULL on failure.
*/

Note that I used deprecated _PyUnicode_AsStringAndSize() in
floatobject.c not only because it is convenient, but also because I
believe that in the future numerical value parsers should be converted
to operate on unicode characters.  When this happens, the use of
_PyUnicode_AsStringAndSize() can be removed.
msg123087 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-12-02 17:38
I am submitting a patch (issue10557b.diff) for commit review.  As Marc suggested, decimal conversion is now performed on Py_UNICODE characters. For this purpose, I introduced _PyUnicode_NormalizeDecimal() function that takes Py_UNICODE and returns a PyUnicode object with whitespace stripped and non-ASCII digits converted to ASCII equivalents.  The PyUnicode_EncodeDecimal() function is no longer used and I added a comment recommending that _PyUnicode_NormalizeDecimal() be used instead. I would like to eventually remove PyUnicode_EncodeDecimal(), but I amd not sure about the proper deprecation procedures for undocumented C APIs.

As a result, int(), float(), etc will no longer raise UnicodeDecodeError unless given a string with lone surrogates.  (This error comes from UTF-8 codec that is applied after digit normalization.)

A few error cases such as embedded '\0' and non-digit characters with ord(c) > 255 will now raise ValueError instead of UnicodeDecodeError.  Since UnicodeDecodeError is a subclass of ValueError, it is unlikely that existing code would attempt to differentiate between the two.  It is possible to achieve complete compatibility, but it is hard to justify reporting different error types on non-digit characters below and above code point 255.

The patch contains tests for error messages that I tried to make robust by only requiring that s.strip() be found somewhere in the error message from int(s).  Note that since in this patch whitespace is stripped before the string is passed to the parser, the parser errors do not contain the whitespace.  This may actually be desirable because it helps the user to see the source of the error without being distracted by irrelevant white space.
msg123088 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-12-02 17:54
Is the stripping of whitespace necessary for this fix?

Currently, the complex constructor accepts whitespace both inside and outside the (optional) parentheses:

>>> complex(' ( 2+3j ) ')
(2+3j)

The classes of whitespace accepted in each position are the same.  IIUC, with your patch, that consistency would be lost---is that right?

If the whitespace stripping isn't necessary then I'd prefer to leave that change for another issue.
msg123091 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-12-02 18:08
Just to clarify:  I'm not opposed to allowing arbitrary Unicode whitespace in the float, int, complex constructors (indeed, it's probably a good thing).  But I'd like to see the change made consistently;  for the complex constructor this looks a bit involved, so it would probably be cleaner to have a separate patch to make the whitespace changes.
msg123094 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-12-02 18:28
On Thu, Dec 2, 2010 at 12:54 PM, Mark Dickinson <report@bugs.python.org> wrote:
..
> The classes of whitespace accepted in each position are the same.  IIUC, with your patch,
> that consistency would be lost---is that right?

Good point. I thought The PyUnicode_EncodeDecimal() was stripping the
space, but it was converting it to ASCII ' ' instead.  That's easy to
fix.   Can you suggest a test case?
msg123095 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-12-02 18:28
On Thu, Dec 2, 2010 at 1:28 PM, Alexander Belopolsky
<belopolsky@users.sourceforge.net> wrote:
..
>  Can you suggest a test case?

I mean for complex().
msg123103 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-12-02 19:20
Ah yes, you're right: this shouldn't be a hard fix. I withdraw my suggestion for a separate patch.  :-)

Checking that:

  complex('\xa0(\xa02+3j\xa0)\xa0') == complex(2.0, 3.0)

would probably be enough.
msg123123 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-12-02 21:34
Alexander Belopolsky wrote:
> 
> Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:
> 
> I am submitting a patch (issue10557b.diff) for commit review.  As Marc suggested, decimal conversion is now performed on Py_UNICODE characters. For this purpose, I introduced _PyUnicode_NormalizeDecimal() function that takes Py_UNICODE and returns a PyUnicode object with whitespace stripped and non-ASCII digits converted to ASCII equivalents.  The PyUnicode_EncodeDecimal() function is no longer used and I added a comment recommending that _PyUnicode_NormalizeDecimal() be used instead. I would like to eventually remove PyUnicode_EncodeDecimal(), but I amd not sure about the proper deprecation procedures for undocumented C APIs.
> 
> As a result, int(), float(), etc will no longer raise UnicodeDecodeError unless given a string with lone surrogates.  (This error comes from UTF-8 codec that is applied after digit normalization.)
> 
> A few error cases such as embedded '\0' and non-digit characters with ord(c) > 255 will now raise ValueError instead of UnicodeDecodeError.  Since UnicodeDecodeError is a subclass of ValueError, it is unlikely that existing code would attempt to differentiate between the two.  It is possible to achieve complete compatibility, but it is hard to justify reporting different error types on non-digit characters below and above code point 255.
> 
> The patch contains tests for error messages that I tried to make robust by only requiring that s.strip() be found somewhere in the error message from int(s).  Note that since in this patch whitespace is stripped before the string is passed to the parser, the parser errors do not contain the whitespace.  This may actually be desirable because it helps the user to see the source of the error without being distracted by irrelevant white space.

Thanks for the patch. I've had a quick look...

Some comments:

 * Please change the API _PyUnicode_NormalizeDecimal() to
   PyUnicode_ConvertToASCIIDecimal() - that's closer to what
   it does.

 * Don't have the API remove any whitespace. It should just
   work on decimal digit code points (chainging the length
   of the Unicode string is a bad idea).

 * Please remove the note "This function is no longer used.
   Use _PyUnicode_NormalizeDecimal instead." from the
   PyUnicode_EncodeDecimal() API description in the
   header file. The API won't go away (it does have its
   use and is being used in 3rd party extensions) and
   you cannot guide people to use a Python private API.

 * Please double check the ref counts. I think you have a leak
   in PyLong_FromUnicode() (for norm) and possible in other
   functions as well.
msg123133 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-12-02 22:08
On Thu, Dec 2, 2010 at 4:34 PM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:
..
>  * Please change the API _PyUnicode_NormalizeDecimal() to
>   PyUnicode_ConvertToASCIIDecimal() - that's closer to what
>   it does.
>

Are you sure it is a good idea to give it a public name?  I have no
problem with calling it _PyUnicode_ConvertToASCIIDecimal().
("Transform" may be a better term, though.)

>  * Don't have the API remove any whitespace. It should just
>   work on decimal digit code points (chainging the length
>   of the Unicode string is a bad idea).
>

Yes, that was a bad idea, but the old EncodeDecimal was replacing all
Unicode space with ASCII ' '.  It will be hard to replicate old
behavior without doing the same in  ConvertToASCIIDecimal().

>  * Please remove the note "This function is no longer used.
>   Use _PyUnicode_NormalizeDecimal instead." from the
>   PyUnicode_EncodeDecimal() API description in the
>   header file. The API won't go away (it does have its
>   use and is being used in 3rd party extensions) and
>   you cannot guide people to use a Python private API.
>

OK.  I had the same reservations about recommending private API.

>  * Please double check the ref counts. I think you have a leak
>   in PyLong_FromUnicode() (for norm) and possible in other
>   functions as well.

Will do.  I should also add some more tests for error conditions.  I
test for leaks, but if the error branch is not covered, it is not
covered.
msg123144 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2010-12-02 23:32
Alexander Belopolsky <report@bugs.python.org> wrote:
> On Thu, Dec 2, 2010 at 4:34 PM, Marc-Andre Lemburg
> <report@bugs.python.org> wrote:
> ..
> >  * Please change the API _PyUnicode_NormalizeDecimal() to
> >   PyUnicode_ConvertToASCIIDecimal() - that's closer to what
> >   it does.
> >
> 
> Are you sure it is a good idea to give it a public name?  I have no
> problem with calling it _PyUnicode_ConvertToASCIIDecimal().
> ("Transform" may be a better term, though.)

I like the public name. Extension authors can use it and be sure that
their programs accept exactly the same numeric strings as the rest of
Python.

Are you worried that the semantics might change? If they do, I would
actually welcome to have an official transformation function that
automatically follows the current preferences of python-dev (or the
Unicode Consortium).
msg123147 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-12-03 00:07
On Thu, Dec 2, 2010 at 6:32 PM, Stefan Krah <report@bugs.python.org> wrote:
..
> I like the public name. Extension authors can use it and be sure that
> their programs accept exactly the same numeric strings as the rest of
> Python.
>
> Are you worried that the semantics might change?

Yes, I am already working on a change that instead of stripping
whitespace will replace it with ASCI space.  I don't think that a
public TransformDecimalToASCII() function should care about space, but
in order to support current semantics without making an extra copy, I
need to do all replacements together.  (And of course I would not want
to publicly expose a function that modifies the string in-place.)

> If they do, I would
> actually welcome to have an official transformation function that
> automatically follows the current preferences of python-dev (or the
> Unicode Consortium).
>

I don't think either group has clearly articulated their preferences
with respect to such algorithm.
msg123166 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-12-03 02:53
I am submitting a new patch that excludes int() changes.  The honest reason for the exclusion is that I gave up chasing a bug that only shows in full regrtest runs.  (Marc, I don't think it is related to what you thought was a missing norm decref: that place had result = NULL, not return NULL, so the contral was falling through to the decref after the if statement.)  Nevertheless, I think it makes sense to proceed with smaller steps.  As Marc suggested, PyUnicode_EncodeDecimal() API will stay indefinitely, so there is no urge to remove its use.  There are no actual bugs associated with int(), so technically it does not belong to this issue.
msg123218 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-12-03 09:45
Alexander Belopolsky wrote:
> 
> Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:
> 
> On Thu, Dec 2, 2010 at 4:34 PM, Marc-Andre Lemburg
> <report@bugs.python.org> wrote:
> ..
>>  * Please change the API _PyUnicode_NormalizeDecimal() to
>>   PyUnicode_ConvertToASCIIDecimal() - that's closer to what
>>   it does.
>>
> 
> Are you sure it is a good idea to give it a public name?  I have no
> problem with calling it _PyUnicode_ConvertToASCIIDecimal().
> ("Transform" may be a better term, though.)

Yes, I think it's useful to have as public API.

>>  * Don't have the API remove any whitespace. It should just
>>   work on decimal digit code points (chainging the length
>>   of the Unicode string is a bad idea).
>>
> 
> Yes, that was a bad idea, but the old EncodeDecimal was replacing all
> Unicode space with ASCII ' '.  It will be hard to replicate old
> behavior without doing the same in  ConvertToASCIIDecimal().

Are you sure ? I'm not sure how the underlying PyOS_string_to_double()
(IIRC) works. If it needs ASCII spaces, it's probably OK to just do
the replacement in the constructors. This may have a side-effect
on error reporting, though, e.g. if you convert TABs to single
spaces, the error message will use different formatting than
the original string.

>>  * Please remove the note "This function is no longer used.
>>   Use _PyUnicode_NormalizeDecimal instead." from the
>>   PyUnicode_EncodeDecimal() API description in the
>>   header file. The API won't go away (it does have its
>>   use and is being used in 3rd party extensions) and
>>   you cannot guide people to use a Python private API.
>>
> 
> OK.  I had the same reservations about recommending private API.
> 
>>  * Please double check the ref counts. I think you have a leak
>>   in PyLong_FromUnicode() (for norm) and possible in other
>>   functions as well.
> 
> Will do.  I should also add some more tests for error conditions.  I
> test for leaks, but if the error branch is not covered, it is not
> covered.

Thanks. Error branches often introduce such leaks; which is why
I got used to having an onError: goto target in functions which
then do Py_XDECREF() on all temporary variables I need in the
function.
msg123222 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-12-03 10:02
> Are you sure ? I'm not sure how the underlying PyOS_string_to_double()
> (IIRC) works.

I believe it accepts ASCII whitespace (i.e., chars ' ', '\t', '\f', '\n', '\r', '\v'), and nothing else.
msg123230 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2010-12-03 12:28
According to comments in the code and verified by inspection, PyOS_string_to_double does not accept any whitespace.
msg123240 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-12-03 12:47
> According to comments in the code and verified by inspection,
> PyOS_string_to_double does not accept any whitespace.

Bah.  You're right, of course.  :-)

Any whitespace (post PyUnicode_EncodeDecimal) is handled in PyFloat_FromString, using Py_ISSPACE.
msg123260 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-12-03 16:41
On Fri, Dec 3, 2010 at 4:45 AM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:
..
>> On Thu, Dec 2, 2010 at 4:34 PM, Marc-Andre Lemburg
>> <report@bugs.python.org> wrote:
>> ..
>>>  * Please change the API _PyUnicode_NormalizeDecimal() to
>>>   PyUnicode_ConvertToASCIIDecimal() - that's closer to what
>>>   it does.
>>>
>>
>> Are you sure it is a good idea to give it a public name?  I have no
>> problem with calling it _PyUnicode_ConvertToASCIIDecimal().
>> ("Transform" may be a better term, though.)
>
> Yes, I think it's useful to have as public API.
>

I am afraid this means it has to go in today.    I'll take the white
space processing out of the public method and try to get the patch
ready for commit.
msg123275 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-12-03 17:54
On Thu, Dec 2, 2010 at 9:53 PM, Alexander Belopolsky
<report@bugs.python.org> wrote:
..
> .. The honest reason for the exclusion is that I gave up chasing a bug that only shows
> in full regrtest runs.

I have realized where the problem was. PyUnicode_FromUnicode()
"helpfully" interns single-character  Unicode objects in the Latin-1
range.  So when TransformDecimal replaces whitespace with ' ', it may
garble cached strings.  I think this optimization is a left-over from
the time when unicode did not have interning, but it is never a good
idea to change immutable objects in-place after creation, so I'll fix
this problem in my code.
msg123315 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-12-04 03:04
Hopefully this is the last iteration before commit.  As discussed, I took whitespace processing out of PyUnicode_TransformDecimalToASCII() and made it public.  Whitespace conversion in int()/float()/complex() is repetitious and can be optimized by, for example only converting leading and trailing whitespace.  I erred on the side of correctness here and real optimization will come from making conversion algorithms operate directly on Py_UNICODE characters.  This looks like a relatively easy thing to do, but is definitely outside of the scope of this issue.
msg123316 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-12-04 03:43
Committed in revision 87007.  As a bug fix, this needs to be backported to 3.1, but PyUnicode_TransformDecimalToASCII() should probably be renamed to _PyUnicode_TransformDecimalToASCII() to avoid introduction of a new feature.
msg123335 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-12-04 11:03
Looks okay, I guess.

I don't much like the extra boilerplate that's introduced (and repeated) in longobject.c, floatobject.c and complexobject.c, though.
msg123388 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-12-04 19:31
On Sat, Dec 4, 2010 at 6:03 AM, Mark Dickinson <report@bugs.python.org> wrote:
..
> I don't much like the extra boilerplate that's introduced (and repeated)
> in longobject.c, floatobject.c and complexobject.c, though.
>

Yes, that's exactly what I meant when I called that code
"repetitious."  Maybe we can tighten this up during the beta period.
What do you think about adding number parsers that operate directly on
Py_UNICODE* strings?
msg123398 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-12-04 20:11
> What do you think about adding number parsers that operate directly on
> Py_UNICODE* strings?

I think that might make some sense.  It's not without difficulties, though.  One issue is that we'd still need the char* -> double operations, partly because PyOS_string_to_double is part of the public API, and partly to continue to support creation of a float from a bytes instance.

The other issue is that for floats, it's difficult to separate the parser from the base conversion;  to be useful, we'd probably end up making the whole of dtoa.c Py_UNICODE aware.  (One of the return values from the dtoa.c parser is a pointer to the significant digits in the original input string;  so the base-conversion calculation itself needs access to portions of the original string.)

Ideally, for float(string), we'd have a zero-copy setup that operated directly on the unicode input (read-only);  but I think that achieving that right now is going to be messy, and involve dtoa.c knowing far more about Unicode that I'd be comfortable with.

N.B. If we didn't have to deal with alternative digits, it *really* would be much simpler.

Perhaps a compromise option is available, that does a preliminary pass on the Unicode string and only makes a copy if non-European digits are discovered.
msg123403 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-12-04 21:03
On Sat, Dec 4, 2010 at 3:11 PM, Mark Dickinson <report@bugs.python.org> wrote:
>
> Mark Dickinson <dickinsm@gmail.com> added the comment:
>.. One issue is that we'd still need the char* -> double operations, partly because
> PyOS_string_to_double is part of the public API, and partly to continue to support
> creation of a float from a bytes instance.
>

I thought about it.  I see two solutions:

1. Retain PyOS_string_to_double unchanged and add PyOS_unicode_to_double.
2. Replace PyOS_string_to_double with UTF-8 decode result passed to
PyOS_unicode_to_double.

> The other issue is that for floats, it's difficult to separate the parser from the base
> conversion;  to be useful, we'd probably end up making the whole of dtoa.c
> Py_UNICODE aware.

That's what I had in mind.  Naively it looks like we just need to
replace char type with Py_UNICODE in several places.  Assuming exotic
digit conversion is still handled separately.

>  (One of the return values from the dtoa.c parser is a pointer to the significant digits
> in the original input string;  so the base-conversion calculation itself needs access
> to portions of the original string.)
>

Maybe we should start with int().  It is simpler, but probably reveal
some of the same difficulties as float()

> Ideally, for float(string), we'd have a zero-copy setup that operated directly on the
> unicode input (read-only);  but I think that achieving that right now is going to be
> messy, and involve dtoa.c knowing far more about Unicode that I'd be comfortable
> with.
>

This is clearly a 3.3-ish project.  Hopefully in time people will
realize that decimal digits are just [0-9] and numeric experts will
not be required to know about Unicode beyond 127th code point. :-)

> N.B. If we didn't have to deal with alternative digits, it *really* would be much simpler.
>

We still don't.  I've already separated this out and we can keep it
this way as long as people are willing to pay the price for
alternative digits' support.

One thing we may improve, is to fail earlier on non-digits in
PyUnicode_TransformDecimalToASCII()  to speedup not uncommon code like
this:

for line in f:
   try:
       n = int(lint)
   except ValueError:
       pass
   ...

> Perhaps a compromise option is available, that does a preliminary pass on the
> Unicode string and only makes a copy if non-European digits are discovered.

Hmm.  That would require changing the signature of
PyUnicode_TransformDecimalToASCII() to take PyObject* instead of the
buffer.  I knew we shouldn't have rushed to make it public.  We can
still do it in longobject.c and friends' boilerplate.
msg179047 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-04 17:12
Can this issue be closed?
msg181455 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2013-02-05 16:26
Sure, this can be closed as far as I'm concerned.
msg181824 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2013-02-10 18:12
Closing.
History
Date User Action Args
2013-02-10 18:12:02mark.dickinsonsetstatus: pending -> closed

messages: + msg181824
2013-02-05 16:26:22mark.dickinsonsetstatus: open -> pending
2013-02-05 16:26:14mark.dickinsonsetstatus: pending -> open

messages: + msg181455
2013-02-05 14:38:55serhiy.storchakasetstatus: open -> pending
2013-01-04 17:12:12serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg179047
2011-11-21 20:17:02vstinnersetversions: + Python 3.3, - Python 3.1
2010-12-20 18:52:10belopolskysetnosy: lemburg, mark.dickinson, belopolsky, eric.smith, ezio.melotti, skrah
versions: + Python 3.1
2010-12-04 21:03:18belopolskysetmessages: + msg123403
2010-12-04 20:11:53mark.dickinsonsetmessages: + msg123398
2010-12-04 19:31:47belopolskysetmessages: + msg123388
2010-12-04 11:03:16mark.dickinsonsetmessages: + msg123335
2010-12-04 03:43:18belopolskysetresolution: fixed
messages: + msg123316
2010-12-04 03:04:57belopolskysetfiles: + issue10557d.diff

messages: + msg123315
2010-12-03 17:54:26belopolskysetmessages: + msg123275
2010-12-03 16:41:42belopolskysetmessages: + msg123260
2010-12-03 12:47:46mark.dickinsonsetmessages: + msg123240
2010-12-03 12:28:31eric.smithsetmessages: + msg123230
2010-12-03 10:02:23mark.dickinsonsetmessages: + msg123222
2010-12-03 09:57:35vstinnersetnosy: - vstinner
2010-12-03 09:45:02lemburgsetmessages: + msg123218
2010-12-03 02:53:36belopolskysetfiles: + issue10557c.diff

messages: + msg123166
2010-12-03 00:07:32belopolskysetmessages: + msg123147
2010-12-02 23:32:01skrahsetmessages: + msg123144
2010-12-02 22:08:48belopolskysetmessages: + msg123133
2010-12-02 21:34:42lemburgsetmessages: + msg123123
2010-12-02 19:20:05mark.dickinsonsetmessages: + msg123103
2010-12-02 18:28:56belopolskysetmessages: + msg123095
2010-12-02 18:28:15belopolskysetmessages: + msg123094
2010-12-02 18:08:28mark.dickinsonsetmessages: + msg123091
2010-12-02 17:54:27mark.dickinsonsetmessages: + msg123088
2010-12-02 17:38:56belopolskysetfiles: + issue10557b.diff
2010-12-02 17:38:24belopolskysetassignee: belopolsky
messages: + msg123087
stage: test needed -> commit review
2010-11-29 15:39:49belopolskysetfiles: + issue10557a.diff

messages: + msg122816
2010-11-29 09:41:33lemburgsetmessages: + msg122785
2010-11-29 06:23:07belopolskysetfiles: + issue10557.diff

messages: + msg122775
2010-11-28 20:38:25belopolskysetmessages: - msg122725
2010-11-28 20:38:11belopolskysetmessages: + msg122726
2010-11-28 20:36:59belopolskysetmessages: + msg122725
2010-11-28 20:30:39skrahsetmessages: + msg122723
2010-11-28 20:22:01belopolskysetmessages: + msg122719
2010-11-28 20:09:12skrahsetmessages: + msg122716
2010-11-28 18:37:40belopolskysetmessages: + msg122698
2010-11-28 18:09:20lemburgsetmessages: + msg122693
2010-11-28 18:01:45lemburgsetmessages: + msg122691
2010-11-28 17:48:45skrahsetnosy: + skrah
messages: + msg122689
2010-11-28 17:34:03belopolskysetmessages: + msg122688
2010-11-28 17:33:45mark.dickinsonsetmessages: + msg122687
2010-11-28 17:24:19belopolskysetmessages: + msg122684
2010-11-28 17:00:08mark.dickinsonsetnosy: + lemburg
messages: + msg122683
2010-11-28 16:36:58mark.dickinsonsetmessages: + msg122679
2010-11-28 10:49:11eric.smithsetnosy: + eric.smith
2010-11-28 05:37:13ezio.melottisetmessages: + msg122634
2010-11-28 04:29:32ezio.melottisetmessages: + msg122625
2010-11-28 03:22:23belopolskycreate