Issue10557
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2010-11-28 03:22 by belopolsky, last changed 2022-04-11 14:57 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
float-error.diff | belopolsky, 2010-11-28 03:22 | review | ||
issue10557.diff | belopolsky, 2010-11-29 06:23 | review | ||
issue10557a.diff | belopolsky, 2010-11-29 15:39 | review | ||
issue10557b.diff | belopolsky, 2010-12-02 17:38 | review | ||
issue10557c.diff | belopolsky, 2010-12-03 02:53 | review | ||
issue10557d.diff | belopolsky, 2010-12-04 03:04 | review |
Messages (45) | |||
---|---|---|---|
msg122612 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2010-11-28 03:22 | |
>>> float('½') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: could not convert string to float: � >>> float('42½') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError With the attached patch, float-error.diff >>> float('½') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: invalid literal for float(): ½ >>> float('42½') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: invalid literal for float(): 42½ Note that the proposed patch also has an effect of disallowing non-ascii digits in float() constructor. Before the patch: >>> float('١٢٣٤.٥٦') 1234.56 After the patch: >>> float('١٢٣٤.٥٦') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: could not convert string to float: ١٢٣٤.٥٦ I am not sure, whether support for non-ascii digits in float() constructor is worth maintaining. (Anyone knows whether Arabic numbers are written right to left or left to right? What is the proper decimal point character?) Also, I don't think users expect UnicodeEncodeError from float() or int(). Before the patch: >>> float('\uffff') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'decimal' codec can't encode character '\uffff' in position 0: invalid decimal Unicode string After the patch: >>> float('\uffff') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: could not convert string to float: |
|||
msg122625 - (view) | Author: Ezio Melotti (ezio.melotti) * ![]() |
Date: 2010-11-28 04:29 | |
I think float() should support non-ascii digits but I agree that it would be better to avoid UnicodeErrors and convert them to ValueErrors so that >>> float('١٢٣٤.٥٦') 1234.56 and >>> float('½') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: invalid literal for float(): ½ I.e. float should do the C equivalent of: try: s = arg.encode('decimal') except UnicodeEncodeError: raise ValueError('Invalid liter for float(): {}'.format(arg)) Note that int() and Decimal() supports non-ascii chars too. |
|||
msg122634 - (view) | Author: Ezio Melotti (ezio.melotti) * ![]() |
Date: 2010-11-28 05:37 | |
FWIW the UnicodeError comes from PyUnicode_EncodeDecimal (unicodeobject.c:6212) and the "ValueError: could not convert string to float" with the buggy � comes from PyOS_string_to_double (pystrtod.c:316). Maybe PyOS_string_to_double should be fixed to display the string correctly, but I don't know what will be affected from that (and if it will make things worse or better). The UnicodeError can be fixed in PyFloat_FromString (floatobject.c:174). |
|||
msg122679 - (view) | Author: Mark Dickinson (mark.dickinson) * ![]() |
Date: 2010-11-28 16:36 | |
> I am not sure, whether support for non-ascii digits in float() > constructor is worth maintaining. I'd be very happy to drop such support. If you allow alternative digit sets, trying to work out exactly what should be supported and what shouldn't gets very messy; it's even worse for int(), where bases > 10 have to be taken into account. > (Anyone knows whether Arabic numbers are written right to left > or left to right? What is the proper decimal point character?) Well, judging by the chocolate packaging I saw recently, they're written left to right (so presumably if you're reading right-to-left, you see the units first, then the tens, etc., which always struck me as the 'right' way to write things in the first place :-). No idea about the proper decimal point character, though. |
|||
msg122683 - (view) | Author: Mark Dickinson (mark.dickinson) * ![]() |
Date: 2010-11-28 17:00 | |
About Alexander's solution: might it make more sense to have PyUnicode_EncodeDecimal raise for inputs like this? I see it as PyUnicode_EncodeDecimal's job to turn the unicode input into usable ASCII (or raise an exception); it looks like that's not happening here. Adding MAL to the nosy in case he wants to comment on this. |
|||
msg122684 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2010-11-28 17:24 | |
On Sun, Nov 28, 2010 at 11:37 AM, Mark Dickinson <report@bugs.python.org> wrote: > > Mark Dickinson <dickinsm@gmail.com> added the comment: > >> I am not sure, whether support for non-ascii digits in float() >> constructor is worth maintaining. > > I'd be very happy to drop such support. If you allow alternative digit sets, trying to work out exactly what should > be supported and what shouldn't gets very messy; it's even worse for int(), where bases > 10 have to be taken > into account. > I think there is a proposal (if not already in Unicode 6.0) to add hexadecimal value property to UCD, but I don't think it is right for int() and float() to depend on UCD in the first place. If people need to process exotic decimals, we can expose 'decimal' encoding, however, translating from code-point to decimal value is trivial - just subtract the code-point value for zero. >> (Anyone knows whether Arabic numbers are written right to left >> or left to right? What is the proper decimal point character?) > > Well, judging by the chocolate packaging I saw recently, they're written left to right > (so presumably if you're reading right-to-left, you see the units first, then the tens, etc., > which always struck me as the 'right' way to write things in the first place :-). Did your chocolate packaging use European digits or Arabic-Indic ones? Note that they have different bidi properties: 'EN' >>> unicodedata.bidirectional('\N{ARABIC-INDIC DIGIT ONE}') 'AN' (I am not sure what 'AN' (Arabic Numeral?) bidi property actually mean.) |
|||
msg122687 - (view) | Author: Mark Dickinson (mark.dickinson) * ![]() |
Date: 2010-11-28 17:33 | |
> Did your chocolate packaging use European digits or Arabic-Indic ones? > Note that they have different bidi properties: Good question; I think it was Arabic-Indic digits, but to be honest I don't remember. (It wasn't *all* that recently.) |
|||
msg122688 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2010-11-28 17:34 | |
On Sun, Nov 28, 2010 at 12:00 PM, Mark Dickinson <report@bugs.python.org> wrote: > > Mark Dickinson <dickinsm@gmail.com> added the comment: > > About Alexander's solution: might it make more sense to have PyUnicode_EncodeDecimal raise > for inputs like this? No, I think PyOS_string_to_double() can generate better error messages than PyUnicode_EncodeDecimal. It is important to pass losslessly encoded string to PyOS_string_to_double() for proper error reporting. Otherwise, we will have to catch the error in PyFloat_FromString() just to add the string value to the message and may loose other information such as the precise location of the offending character. (AFAICT, we don't make use of it now, but this would be a meaningful improvement.) > I see it as PyUnicode_EncodeDecimal's job to turn the unicode input into usable ASCII > (or raise an exception); it looks like that's not happening here. UTF-8 is quite usable by PyOS_string_to_double() . UTF-8 encoder is extremely fast and will only get faster over time. In my opinion, PyUnicode_EncodeDecimal() is either unnecessary or should be exposed as a codec. |
|||
msg122689 - (view) | Author: Stefan Krah (skrah) * ![]() |
Date: 2010-11-28 17:48 | |
> PyUnicode_EncodeDecimal() is either unnecessary or should be exposed > as a codec. I'm depending on PyUnicode_EncodeDecimal in cdecimal. In fact, it saved me quite a bit of trouble. I wouldn't be surprised if other extension writers use it as well. |
|||
msg122691 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2010-11-28 18:01 | |
Mark Dickinson wrote: > > Mark Dickinson <dickinsm@gmail.com> added the comment: > > About Alexander's solution: might it make more sense to have PyUnicode_EncodeDecimal raise for inputs like this? I see it as PyUnicode_EncodeDecimal's job to turn the unicode input into usable ASCII (or raise an exception); it looks like that's not happening here. > > Adding MAL to the nosy in case he wants to comment on this. The purpose of the PyUnicode_EncodeDecimal() API is to convert Unicode decimal digits to ASCII digits for interpretation by the other usual functions to convert the ASCII representation to numbers. The proposed patch will not work, since it removes the support for non-ASCII number code points, e.g. Asian number code points. You might want to replace the error message by something more related to floats in the float constructor. Note that UnicodeErrors are subclasses of ValueErrors, so the errors are not unexpected for number constructors. |
|||
msg122693 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2010-11-28 18:09 | |
>>>> float('½') > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > ValueError: could not convert string to float: � > >>>> float('42½') > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > ValueError Note that fractional Unicode code points are not supported by the encoding function. Neither are code points which do not evaluate to 0-9, e.g. ones that represent numbers larger than 9. |
|||
msg122698 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2010-11-28 18:37 | |
Issue #10567 demonstrated the problem of relying on the Unicode database in Python builtins. Apparently, Unicode does not guarantee stability of the character categories. On the other hand, we are already tied to UCD for the language definition. Maybe Python should document the version of Unicode it is using in any given version and possibly upgrade to Unicode 6.0 should be postponed until the language moratorium is lifted. |
|||
msg122716 - (view) | Author: Stefan Krah (skrah) * ![]() |
Date: 2010-11-28 20:09 | |
> (Anyone knows whether Arabic numbers are written right to left or left > to right? What is the proper decimal point character?) Thousands separator: U+066C Decimal point: U+066B ١٢٣٬١٢٣٫١٢ should be: 123,123.12 Wikipedia says that digits are arranged in the usual way, lowest value digit to the right. |
|||
msg122719 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2010-11-28 20:22 | |
On Sun, Nov 28, 2010 at 3:09 PM, Stefan Krah <report@bugs.python.org> wrote: .. > Decimal point: U+066B Well, not so fast: Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'decimal' codec can't encode character '\u066b' in position 1: invalid decimal Unicode string |
|||
msg122723 - (view) | Author: Stefan Krah (skrah) * ![]() |
Date: 2010-11-28 20:30 | |
> UnicodeEncodeError: 'decimal' codec can't encode character '\u066b' Hmm, looks like a bug? I think U+066B is correct. |
|||
msg122726 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2010-11-28 20:38 | |
Sending this by e-mail was not a good idea ... On Sun, Nov 28, 2010 at 3:30 PM, Stefan Krah <report@bugs.python.org> wrote: .. >> UnicodeEncodeError: 'decimal' codec can't encode character '\u066b' > > Hmm, looks like a bug? I think U+066B is correct. > Really? What about >>> float('1234.56') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'decimal' codec can't encode character '\uff0e' in position 4: invalid decimal Unicode string .. and where do we draw the line? Note that I am not against Decimal() accepting any c with c.isdigit() returning True, but builtins should be less promiscuous IMO. |
|||
msg122775 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2010-11-29 06:23 | |
After a bit of svn archeology, it does appear that Arabic-Indic digits' support was deliberate at least in the sense that the feature was tested for when the code was first committed. See r15000. The test migrated from file to file over the last 10 years, but it is still present in test_float.py: self.assertEqual(float(b" \u0663.\u0661\u0664 ".decode('raw-unicode-escape')), 3.14) (It should probably be now rewritten using a string literal.) I am now attaching the patch (issue10557.diff) that fixes the bug without sacrificing non-ASCII digit support. If this approach is well-received, I would like to replace all calls to PyUnicode_EncodeDecimal() with the calls to the new _PyUnicode_EncodeDecimalUTF8() and deprecate Latin-1-oriented PyUnicode_EncodeDecimal(). For the future, I note that starting with Unicode 6.0.0, the Unicode Consortium promises that """ Characters with the property value Numeric_Type=de (Decimal) only occur in contiguous ranges of 10 characters, with ascending numeric values from 0 to 9 (Numeric_Value=0..9). """ This makes it very easy to check a numeric string does not contain a mix of digits from different scripts. I still believe that proper API should require explicit choice of language or locale before allowing digits other than 0-9 just as int() would not accept hexadecimal digits without explicit choice of base >= 16. But this would be a subject of a feature request. |
|||
msg122785 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2010-11-29 09:41 | |
Alexander Belopolsky wrote: > > Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment: > > After a bit of svn archeology, it does appear that Arabic-Indic digits' support was deliberate at least in the sense that the feature was tested for when the code was first committed. See r15000. As I mentioned on python-dev (http://mail.python.org/pipermail/python-dev/2010-November/106077.html) this support was added intentionally. > The test migrated from file to file over the last 10 years, but it is still present in test_float.py: > > self.assertEqual(float(b" \u0663.\u0661\u0664 ".decode('raw-unicode-escape')), 3.14) > > (It should probably be now rewritten using a string literal.) > > I am now attaching the patch (issue10557.diff) that fixes the bug without sacrificing non-ASCII digit support. > If this approach is well-received, I would like to replace all calls to PyUnicode_EncodeDecimal() with the calls to the new _PyUnicode_EncodeDecimalUTF8() and deprecate Latin-1-oriented PyUnicode_EncodeDecimal(). It would be better to copy and iterate over the Unicode string first, replacing any decimal code points with ASCII ones and then call the UTF-8 encoder. The code as it stands is very inefficient, since it will most likely run the memcpy() part for every code point after the first non-ASCII decimal one. > For the future, I note that starting with Unicode 6.0.0, the Unicode Consortium promises that > > """ > Characters with the property value Numeric_Type=de (Decimal) only occur in contiguous ranges of 10 characters, with ascending numeric values from 0 to 9 (Numeric_Value=0..9). > """ > > This makes it very easy to check a numeric string does not contain a mix of digits from different scripts. I'm not sure why you'd want to check for such ranges. > I still believe that proper API should require explicit choice of language or locale before allowing digits other than 0-9 just as int() would not accept hexadecimal digits without explicit choice of base >= 16. But this would be a subject of a feature request. Since when do we require a locale or language to be specified when using Unicode ? The codecs, Unicode methods and other Unicode support features happily work with all kinds of languages, mixed or not, without any such specification. |
|||
msg122816 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2010-11-29 15:39 | |
On Mon, Nov 29, 2010 at 4:41 AM, Marc-Andre Lemburg <report@bugs.python.org> wrote: .. > It would be better to copy and iterate over the Unicode string first, > replacing any decimal code points with ASCII ones and then call the > UTF-8 encoder. > Good idea. > The code as it stands is very inefficient, since it will most likely > run the memcpy() part for every code point after the first non-ASCII > decimal one. > I doubt there are measurable gains from this optimization, but doing conversion in Unicode characters results in cleaner API. The new patch, issue10557a.diff, implements _PyUnicode_NormalizeDecimal(Py_UNICODE *s, Py_ssize_t length) which is defined as follows: /* Strip leading and trailing space and convert code points that have decimal digit property to the corresponding ASCII digit code point. Returns a new Unicode string on success, NULL on failure. */ Note that I used deprecated _PyUnicode_AsStringAndSize() in floatobject.c not only because it is convenient, but also because I believe that in the future numerical value parsers should be converted to operate on unicode characters. When this happens, the use of _PyUnicode_AsStringAndSize() can be removed. |
|||
msg123087 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2010-12-02 17:38 | |
I am submitting a patch (issue10557b.diff) for commit review. As Marc suggested, decimal conversion is now performed on Py_UNICODE characters. For this purpose, I introduced _PyUnicode_NormalizeDecimal() function that takes Py_UNICODE and returns a PyUnicode object with whitespace stripped and non-ASCII digits converted to ASCII equivalents. The PyUnicode_EncodeDecimal() function is no longer used and I added a comment recommending that _PyUnicode_NormalizeDecimal() be used instead. I would like to eventually remove PyUnicode_EncodeDecimal(), but I amd not sure about the proper deprecation procedures for undocumented C APIs. As a result, int(), float(), etc will no longer raise UnicodeDecodeError unless given a string with lone surrogates. (This error comes from UTF-8 codec that is applied after digit normalization.) A few error cases such as embedded '\0' and non-digit characters with ord(c) > 255 will now raise ValueError instead of UnicodeDecodeError. Since UnicodeDecodeError is a subclass of ValueError, it is unlikely that existing code would attempt to differentiate between the two. It is possible to achieve complete compatibility, but it is hard to justify reporting different error types on non-digit characters below and above code point 255. The patch contains tests for error messages that I tried to make robust by only requiring that s.strip() be found somewhere in the error message from int(s). Note that since in this patch whitespace is stripped before the string is passed to the parser, the parser errors do not contain the whitespace. This may actually be desirable because it helps the user to see the source of the error without being distracted by irrelevant white space. |
|||
msg123088 - (view) | Author: Mark Dickinson (mark.dickinson) * ![]() |
Date: 2010-12-02 17:54 | |
Is the stripping of whitespace necessary for this fix? Currently, the complex constructor accepts whitespace both inside and outside the (optional) parentheses: >>> complex(' ( 2+3j ) ') (2+3j) The classes of whitespace accepted in each position are the same. IIUC, with your patch, that consistency would be lost---is that right? If the whitespace stripping isn't necessary then I'd prefer to leave that change for another issue. |
|||
msg123091 - (view) | Author: Mark Dickinson (mark.dickinson) * ![]() |
Date: 2010-12-02 18:08 | |
Just to clarify: I'm not opposed to allowing arbitrary Unicode whitespace in the float, int, complex constructors (indeed, it's probably a good thing). But I'd like to see the change made consistently; for the complex constructor this looks a bit involved, so it would probably be cleaner to have a separate patch to make the whitespace changes. |
|||
msg123094 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2010-12-02 18:28 | |
On Thu, Dec 2, 2010 at 12:54 PM, Mark Dickinson <report@bugs.python.org> wrote: .. > The classes of whitespace accepted in each position are the same. IIUC, with your patch, > that consistency would be lost---is that right? Good point. I thought The PyUnicode_EncodeDecimal() was stripping the space, but it was converting it to ASCII ' ' instead. That's easy to fix. Can you suggest a test case? |
|||
msg123095 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2010-12-02 18:28 | |
On Thu, Dec 2, 2010 at 1:28 PM, Alexander Belopolsky <belopolsky@users.sourceforge.net> wrote: .. > Can you suggest a test case? I mean for complex(). |
|||
msg123103 - (view) | Author: Mark Dickinson (mark.dickinson) * ![]() |
Date: 2010-12-02 19:20 | |
Ah yes, you're right: this shouldn't be a hard fix. I withdraw my suggestion for a separate patch. :-) Checking that: complex('\xa0(\xa02+3j\xa0)\xa0') == complex(2.0, 3.0) would probably be enough. |
|||
msg123123 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2010-12-02 21:34 | |
Alexander Belopolsky wrote: > > Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment: > > I am submitting a patch (issue10557b.diff) for commit review. As Marc suggested, decimal conversion is now performed on Py_UNICODE characters. For this purpose, I introduced _PyUnicode_NormalizeDecimal() function that takes Py_UNICODE and returns a PyUnicode object with whitespace stripped and non-ASCII digits converted to ASCII equivalents. The PyUnicode_EncodeDecimal() function is no longer used and I added a comment recommending that _PyUnicode_NormalizeDecimal() be used instead. I would like to eventually remove PyUnicode_EncodeDecimal(), but I amd not sure about the proper deprecation procedures for undocumented C APIs. > > As a result, int(), float(), etc will no longer raise UnicodeDecodeError unless given a string with lone surrogates. (This error comes from UTF-8 codec that is applied after digit normalization.) > > A few error cases such as embedded '\0' and non-digit characters with ord(c) > 255 will now raise ValueError instead of UnicodeDecodeError. Since UnicodeDecodeError is a subclass of ValueError, it is unlikely that existing code would attempt to differentiate between the two. It is possible to achieve complete compatibility, but it is hard to justify reporting different error types on non-digit characters below and above code point 255. > > The patch contains tests for error messages that I tried to make robust by only requiring that s.strip() be found somewhere in the error message from int(s). Note that since in this patch whitespace is stripped before the string is passed to the parser, the parser errors do not contain the whitespace. This may actually be desirable because it helps the user to see the source of the error without being distracted by irrelevant white space. Thanks for the patch. I've had a quick look... Some comments: * Please change the API _PyUnicode_NormalizeDecimal() to PyUnicode_ConvertToASCIIDecimal() - that's closer to what it does. * Don't have the API remove any whitespace. It should just work on decimal digit code points (chainging the length of the Unicode string is a bad idea). * Please remove the note "This function is no longer used. Use _PyUnicode_NormalizeDecimal instead." from the PyUnicode_EncodeDecimal() API description in the header file. The API won't go away (it does have its use and is being used in 3rd party extensions) and you cannot guide people to use a Python private API. * Please double check the ref counts. I think you have a leak in PyLong_FromUnicode() (for norm) and possible in other functions as well. |
|||
msg123133 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2010-12-02 22:08 | |
On Thu, Dec 2, 2010 at 4:34 PM, Marc-Andre Lemburg <report@bugs.python.org> wrote: .. > * Please change the API _PyUnicode_NormalizeDecimal() to > PyUnicode_ConvertToASCIIDecimal() - that's closer to what > it does. > Are you sure it is a good idea to give it a public name? I have no problem with calling it _PyUnicode_ConvertToASCIIDecimal(). ("Transform" may be a better term, though.) > * Don't have the API remove any whitespace. It should just > work on decimal digit code points (chainging the length > of the Unicode string is a bad idea). > Yes, that was a bad idea, but the old EncodeDecimal was replacing all Unicode space with ASCII ' '. It will be hard to replicate old behavior without doing the same in ConvertToASCIIDecimal(). > * Please remove the note "This function is no longer used. > Use _PyUnicode_NormalizeDecimal instead." from the > PyUnicode_EncodeDecimal() API description in the > header file. The API won't go away (it does have its > use and is being used in 3rd party extensions) and > you cannot guide people to use a Python private API. > OK. I had the same reservations about recommending private API. > * Please double check the ref counts. I think you have a leak > in PyLong_FromUnicode() (for norm) and possible in other > functions as well. Will do. I should also add some more tests for error conditions. I test for leaks, but if the error branch is not covered, it is not covered. |
|||
msg123144 - (view) | Author: Stefan Krah (skrah) * ![]() |
Date: 2010-12-02 23:32 | |
Alexander Belopolsky <report@bugs.python.org> wrote: > On Thu, Dec 2, 2010 at 4:34 PM, Marc-Andre Lemburg > <report@bugs.python.org> wrote: > .. > > * Please change the API _PyUnicode_NormalizeDecimal() to > > PyUnicode_ConvertToASCIIDecimal() - that's closer to what > > it does. > > > > Are you sure it is a good idea to give it a public name? I have no > problem with calling it _PyUnicode_ConvertToASCIIDecimal(). > ("Transform" may be a better term, though.) I like the public name. Extension authors can use it and be sure that their programs accept exactly the same numeric strings as the rest of Python. Are you worried that the semantics might change? If they do, I would actually welcome to have an official transformation function that automatically follows the current preferences of python-dev (or the Unicode Consortium). |
|||
msg123147 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2010-12-03 00:07 | |
On Thu, Dec 2, 2010 at 6:32 PM, Stefan Krah <report@bugs.python.org> wrote: .. > I like the public name. Extension authors can use it and be sure that > their programs accept exactly the same numeric strings as the rest of > Python. > > Are you worried that the semantics might change? Yes, I am already working on a change that instead of stripping whitespace will replace it with ASCI space. I don't think that a public TransformDecimalToASCII() function should care about space, but in order to support current semantics without making an extra copy, I need to do all replacements together. (And of course I would not want to publicly expose a function that modifies the string in-place.) > If they do, I would > actually welcome to have an official transformation function that > automatically follows the current preferences of python-dev (or the > Unicode Consortium). > I don't think either group has clearly articulated their preferences with respect to such algorithm. |
|||
msg123166 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2010-12-03 02:53 | |
I am submitting a new patch that excludes int() changes. The honest reason for the exclusion is that I gave up chasing a bug that only shows in full regrtest runs. (Marc, I don't think it is related to what you thought was a missing norm decref: that place had result = NULL, not return NULL, so the contral was falling through to the decref after the if statement.) Nevertheless, I think it makes sense to proceed with smaller steps. As Marc suggested, PyUnicode_EncodeDecimal() API will stay indefinitely, so there is no urge to remove its use. There are no actual bugs associated with int(), so technically it does not belong to this issue. |
|||
msg123218 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2010-12-03 09:45 | |
Alexander Belopolsky wrote: > > Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment: > > On Thu, Dec 2, 2010 at 4:34 PM, Marc-Andre Lemburg > <report@bugs.python.org> wrote: > .. >> * Please change the API _PyUnicode_NormalizeDecimal() to >> PyUnicode_ConvertToASCIIDecimal() - that's closer to what >> it does. >> > > Are you sure it is a good idea to give it a public name? I have no > problem with calling it _PyUnicode_ConvertToASCIIDecimal(). > ("Transform" may be a better term, though.) Yes, I think it's useful to have as public API. >> * Don't have the API remove any whitespace. It should just >> work on decimal digit code points (chainging the length >> of the Unicode string is a bad idea). >> > > Yes, that was a bad idea, but the old EncodeDecimal was replacing all > Unicode space with ASCII ' '. It will be hard to replicate old > behavior without doing the same in ConvertToASCIIDecimal(). Are you sure ? I'm not sure how the underlying PyOS_string_to_double() (IIRC) works. If it needs ASCII spaces, it's probably OK to just do the replacement in the constructors. This may have a side-effect on error reporting, though, e.g. if you convert TABs to single spaces, the error message will use different formatting than the original string. >> * Please remove the note "This function is no longer used. >> Use _PyUnicode_NormalizeDecimal instead." from the >> PyUnicode_EncodeDecimal() API description in the >> header file. The API won't go away (it does have its >> use and is being used in 3rd party extensions) and >> you cannot guide people to use a Python private API. >> > > OK. I had the same reservations about recommending private API. > >> * Please double check the ref counts. I think you have a leak >> in PyLong_FromUnicode() (for norm) and possible in other >> functions as well. > > Will do. I should also add some more tests for error conditions. I > test for leaks, but if the error branch is not covered, it is not > covered. Thanks. Error branches often introduce such leaks; which is why I got used to having an onError: goto target in functions which then do Py_XDECREF() on all temporary variables I need in the function. |
|||
msg123222 - (view) | Author: Mark Dickinson (mark.dickinson) * ![]() |
Date: 2010-12-03 10:02 | |
> Are you sure ? I'm not sure how the underlying PyOS_string_to_double() > (IIRC) works. I believe it accepts ASCII whitespace (i.e., chars ' ', '\t', '\f', '\n', '\r', '\v'), and nothing else. |
|||
msg123230 - (view) | Author: Eric V. Smith (eric.smith) * ![]() |
Date: 2010-12-03 12:28 | |
According to comments in the code and verified by inspection, PyOS_string_to_double does not accept any whitespace. |
|||
msg123240 - (view) | Author: Mark Dickinson (mark.dickinson) * ![]() |
Date: 2010-12-03 12:47 | |
> According to comments in the code and verified by inspection, > PyOS_string_to_double does not accept any whitespace. Bah. You're right, of course. :-) Any whitespace (post PyUnicode_EncodeDecimal) is handled in PyFloat_FromString, using Py_ISSPACE. |
|||
msg123260 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2010-12-03 16:41 | |
On Fri, Dec 3, 2010 at 4:45 AM, Marc-Andre Lemburg <report@bugs.python.org> wrote: .. >> On Thu, Dec 2, 2010 at 4:34 PM, Marc-Andre Lemburg >> <report@bugs.python.org> wrote: >> .. >>> * Please change the API _PyUnicode_NormalizeDecimal() to >>> PyUnicode_ConvertToASCIIDecimal() - that's closer to what >>> it does. >>> >> >> Are you sure it is a good idea to give it a public name? I have no >> problem with calling it _PyUnicode_ConvertToASCIIDecimal(). >> ("Transform" may be a better term, though.) > > Yes, I think it's useful to have as public API. > I am afraid this means it has to go in today. I'll take the white space processing out of the public method and try to get the patch ready for commit. |
|||
msg123275 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2010-12-03 17:54 | |
On Thu, Dec 2, 2010 at 9:53 PM, Alexander Belopolsky <report@bugs.python.org> wrote: .. > .. The honest reason for the exclusion is that I gave up chasing a bug that only shows > in full regrtest runs. I have realized where the problem was. PyUnicode_FromUnicode() "helpfully" interns single-character Unicode objects in the Latin-1 range. So when TransformDecimal replaces whitespace with ' ', it may garble cached strings. I think this optimization is a left-over from the time when unicode did not have interning, but it is never a good idea to change immutable objects in-place after creation, so I'll fix this problem in my code. |
|||
msg123315 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2010-12-04 03:04 | |
Hopefully this is the last iteration before commit. As discussed, I took whitespace processing out of PyUnicode_TransformDecimalToASCII() and made it public. Whitespace conversion in int()/float()/complex() is repetitious and can be optimized by, for example only converting leading and trailing whitespace. I erred on the side of correctness here and real optimization will come from making conversion algorithms operate directly on Py_UNICODE characters. This looks like a relatively easy thing to do, but is definitely outside of the scope of this issue. |
|||
msg123316 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2010-12-04 03:43 | |
Committed in revision 87007. As a bug fix, this needs to be backported to 3.1, but PyUnicode_TransformDecimalToASCII() should probably be renamed to _PyUnicode_TransformDecimalToASCII() to avoid introduction of a new feature. |
|||
msg123335 - (view) | Author: Mark Dickinson (mark.dickinson) * ![]() |
Date: 2010-12-04 11:03 | |
Looks okay, I guess. I don't much like the extra boilerplate that's introduced (and repeated) in longobject.c, floatobject.c and complexobject.c, though. |
|||
msg123388 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2010-12-04 19:31 | |
On Sat, Dec 4, 2010 at 6:03 AM, Mark Dickinson <report@bugs.python.org> wrote: .. > I don't much like the extra boilerplate that's introduced (and repeated) > in longobject.c, floatobject.c and complexobject.c, though. > Yes, that's exactly what I meant when I called that code "repetitious." Maybe we can tighten this up during the beta period. What do you think about adding number parsers that operate directly on Py_UNICODE* strings? |
|||
msg123398 - (view) | Author: Mark Dickinson (mark.dickinson) * ![]() |
Date: 2010-12-04 20:11 | |
> What do you think about adding number parsers that operate directly on > Py_UNICODE* strings? I think that might make some sense. It's not without difficulties, though. One issue is that we'd still need the char* -> double operations, partly because PyOS_string_to_double is part of the public API, and partly to continue to support creation of a float from a bytes instance. The other issue is that for floats, it's difficult to separate the parser from the base conversion; to be useful, we'd probably end up making the whole of dtoa.c Py_UNICODE aware. (One of the return values from the dtoa.c parser is a pointer to the significant digits in the original input string; so the base-conversion calculation itself needs access to portions of the original string.) Ideally, for float(string), we'd have a zero-copy setup that operated directly on the unicode input (read-only); but I think that achieving that right now is going to be messy, and involve dtoa.c knowing far more about Unicode that I'd be comfortable with. N.B. If we didn't have to deal with alternative digits, it *really* would be much simpler. Perhaps a compromise option is available, that does a preliminary pass on the Unicode string and only makes a copy if non-European digits are discovered. |
|||
msg123403 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2010-12-04 21:03 | |
On Sat, Dec 4, 2010 at 3:11 PM, Mark Dickinson <report@bugs.python.org> wrote: > > Mark Dickinson <dickinsm@gmail.com> added the comment: >.. One issue is that we'd still need the char* -> double operations, partly because > PyOS_string_to_double is part of the public API, and partly to continue to support > creation of a float from a bytes instance. > I thought about it. I see two solutions: 1. Retain PyOS_string_to_double unchanged and add PyOS_unicode_to_double. 2. Replace PyOS_string_to_double with UTF-8 decode result passed to PyOS_unicode_to_double. > The other issue is that for floats, it's difficult to separate the parser from the base > conversion; to be useful, we'd probably end up making the whole of dtoa.c > Py_UNICODE aware. That's what I had in mind. Naively it looks like we just need to replace char type with Py_UNICODE in several places. Assuming exotic digit conversion is still handled separately. > (One of the return values from the dtoa.c parser is a pointer to the significant digits > in the original input string; so the base-conversion calculation itself needs access > to portions of the original string.) > Maybe we should start with int(). It is simpler, but probably reveal some of the same difficulties as float() > Ideally, for float(string), we'd have a zero-copy setup that operated directly on the > unicode input (read-only); but I think that achieving that right now is going to be > messy, and involve dtoa.c knowing far more about Unicode that I'd be comfortable > with. > This is clearly a 3.3-ish project. Hopefully in time people will realize that decimal digits are just [0-9] and numeric experts will not be required to know about Unicode beyond 127th code point. :-) > N.B. If we didn't have to deal with alternative digits, it *really* would be much simpler. > We still don't. I've already separated this out and we can keep it this way as long as people are willing to pay the price for alternative digits' support. One thing we may improve, is to fail earlier on non-digits in PyUnicode_TransformDecimalToASCII() to speedup not uncommon code like this: for line in f: try: n = int(lint) except ValueError: pass ... > Perhaps a compromise option is available, that does a preliminary pass on the > Unicode string and only makes a copy if non-European digits are discovered. Hmm. That would require changing the signature of PyUnicode_TransformDecimalToASCII() to take PyObject* instead of the buffer. I knew we shouldn't have rushed to make it public. We can still do it in longobject.c and friends' boilerplate. |
|||
msg179047 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2013-01-04 17:12 | |
Can this issue be closed? |
|||
msg181455 - (view) | Author: Mark Dickinson (mark.dickinson) * ![]() |
Date: 2013-02-05 16:26 | |
Sure, this can be closed as far as I'm concerned. |
|||
msg181824 - (view) | Author: Mark Dickinson (mark.dickinson) * ![]() |
Date: 2013-02-10 18:12 | |
Closing. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:57:09 | admin | set | github: 54766 |
2013-02-10 18:12:02 | mark.dickinson | set | status: pending -> closed messages: + msg181824 |
2013-02-05 16:26:22 | mark.dickinson | set | status: open -> pending |
2013-02-05 16:26:14 | mark.dickinson | set | status: pending -> open messages: + msg181455 |
2013-02-05 14:38:55 | serhiy.storchaka | set | status: open -> pending |
2013-01-04 17:12:12 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages: + msg179047 |
2011-11-21 20:17:02 | vstinner | set | versions: + Python 3.3, - Python 3.1 |
2010-12-20 18:52:10 | belopolsky | set | nosy:
lemburg, mark.dickinson, belopolsky, eric.smith, ezio.melotti, skrah versions: + Python 3.1 |
2010-12-04 21:03:18 | belopolsky | set | messages: + msg123403 |
2010-12-04 20:11:53 | mark.dickinson | set | messages: + msg123398 |
2010-12-04 19:31:47 | belopolsky | set | messages: + msg123388 |
2010-12-04 11:03:16 | mark.dickinson | set | messages: + msg123335 |
2010-12-04 03:43:18 | belopolsky | set | resolution: fixed messages: + msg123316 |
2010-12-04 03:04:57 | belopolsky | set | files:
+ issue10557d.diff messages: + msg123315 |
2010-12-03 17:54:26 | belopolsky | set | messages: + msg123275 |
2010-12-03 16:41:42 | belopolsky | set | messages: + msg123260 |
2010-12-03 12:47:46 | mark.dickinson | set | messages: + msg123240 |
2010-12-03 12:28:31 | eric.smith | set | messages: + msg123230 |
2010-12-03 10:02:23 | mark.dickinson | set | messages: + msg123222 |
2010-12-03 09:57:35 | vstinner | set | nosy:
- vstinner |
2010-12-03 09:45:02 | lemburg | set | messages: + msg123218 |
2010-12-03 02:53:36 | belopolsky | set | files:
+ issue10557c.diff messages: + msg123166 |
2010-12-03 00:07:32 | belopolsky | set | messages: + msg123147 |
2010-12-02 23:32:01 | skrah | set | messages: + msg123144 |
2010-12-02 22:08:48 | belopolsky | set | messages: + msg123133 |
2010-12-02 21:34:42 | lemburg | set | messages: + msg123123 |
2010-12-02 19:20:05 | mark.dickinson | set | messages: + msg123103 |
2010-12-02 18:28:56 | belopolsky | set | messages: + msg123095 |
2010-12-02 18:28:15 | belopolsky | set | messages: + msg123094 |
2010-12-02 18:08:28 | mark.dickinson | set | messages: + msg123091 |
2010-12-02 17:54:27 | mark.dickinson | set | messages: + msg123088 |
2010-12-02 17:38:56 | belopolsky | set | files: + issue10557b.diff |
2010-12-02 17:38:24 | belopolsky | set | assignee: belopolsky messages: + msg123087 stage: test needed -> commit review |
2010-11-29 15:39:49 | belopolsky | set | files:
+ issue10557a.diff messages: + msg122816 |
2010-11-29 09:41:33 | lemburg | set | messages: + msg122785 |
2010-11-29 06:23:07 | belopolsky | set | files:
+ issue10557.diff messages: + msg122775 |
2010-11-28 20:38:25 | belopolsky | set | messages: - msg122725 |
2010-11-28 20:38:11 | belopolsky | set | messages: + msg122726 |
2010-11-28 20:36:59 | belopolsky | set | messages: + msg122725 |
2010-11-28 20:30:39 | skrah | set | messages: + msg122723 |
2010-11-28 20:22:01 | belopolsky | set | messages: + msg122719 |
2010-11-28 20:09:12 | skrah | set | messages: + msg122716 |
2010-11-28 18:37:40 | belopolsky | set | messages: + msg122698 |
2010-11-28 18:09:20 | lemburg | set | messages: + msg122693 |
2010-11-28 18:01:45 | lemburg | set | messages: + msg122691 |
2010-11-28 17:48:45 | skrah | set | nosy:
+ skrah messages: + msg122689 |
2010-11-28 17:34:03 | belopolsky | set | messages: + msg122688 |
2010-11-28 17:33:45 | mark.dickinson | set | messages: + msg122687 |
2010-11-28 17:24:19 | belopolsky | set | messages: + msg122684 |
2010-11-28 17:00:08 | mark.dickinson | set | nosy:
+ lemburg messages: + msg122683 |
2010-11-28 16:36:58 | mark.dickinson | set | messages: + msg122679 |
2010-11-28 10:49:11 | eric.smith | set | nosy:
+ eric.smith |
2010-11-28 05:37:13 | ezio.melotti | set | messages: + msg122634 |
2010-11-28 04:29:32 | ezio.melotti | set | messages: + msg122625 |
2010-11-28 03:22:23 | belopolsky | create |