Message 123087 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	belopolsky
Recipients	belopolsky, eric.smith, ezio.melotti, lemburg, mark.dickinson, skrah, vstinner
Date	2010-12-02.17:38:24
SpamBayes Score	5.107026e-15
Marked as misclassified	No
Message-id	<1291311511.19.0.80649786553.issue10557@psf.upfronthosting.co.za>
In-reply-to

Content
I am submitting a patch (issue10557b.diff) for commit review. As Marc suggested, decimal conversion is now performed on Py_UNICODE characters. For this purpose, I introduced _PyUnicode_NormalizeDecimal() function that takes Py_UNICODE and returns a PyUnicode object with whitespace stripped and non-ASCII digits converted to ASCII equivalents. The PyUnicode_EncodeDecimal() function is no longer used and I added a comment recommending that _PyUnicode_NormalizeDecimal() be used instead. I would like to eventually remove PyUnicode_EncodeDecimal(), but I amd not sure about the proper deprecation procedures for undocumented C APIs. As a result, int(), float(), etc will no longer raise UnicodeDecodeError unless given a string with lone surrogates. (This error comes from UTF-8 codec that is applied after digit normalization.) A few error cases such as embedded '\0' and non-digit characters with ord(c) > 255 will now raise ValueError instead of UnicodeDecodeError. Since UnicodeDecodeError is a subclass of ValueError, it is unlikely that existing code would attempt to differentiate between the two. It is possible to achieve complete compatibility, but it is hard to justify reporting different error types on non-digit characters below and above code point 255. The patch contains tests for error messages that I tried to make robust by only requiring that s.strip() be found somewhere in the error message from int(s). Note that since in this patch whitespace is stripped before the string is passed to the parser, the parser errors do not contain the whitespace. This may actually be desirable because it helps the user to see the source of the error without being distracted by irrelevant white space.

I am submitting a patch (issue10557b.diff) for commit review.  As Marc suggested, decimal conversion is now performed on Py_UNICODE characters. For this purpose, I introduced _PyUnicode_NormalizeDecimal() function that takes Py_UNICODE and returns a PyUnicode object with whitespace stripped and non-ASCII digits converted to ASCII equivalents.  The PyUnicode_EncodeDecimal() function is no longer used and I added a comment recommending that _PyUnicode_NormalizeDecimal() be used instead. I would like to eventually remove PyUnicode_EncodeDecimal(), but I amd not sure about the proper deprecation procedures for undocumented C APIs.

As a result, int(), float(), etc will no longer raise UnicodeDecodeError unless given a string with lone surrogates.  (This error comes from UTF-8 codec that is applied after digit normalization.)

A few error cases such as embedded '\0' and non-digit characters with ord(c) > 255 will now raise ValueError instead of UnicodeDecodeError.  Since UnicodeDecodeError is a subclass of ValueError, it is unlikely that existing code would attempt to differentiate between the two.  It is possible to achieve complete compatibility, but it is hard to justify reporting different error types on non-digit characters below and above code point 255.

The patch contains tests for error messages that I tried to make robust by only requiring that s.strip() be found somewhere in the error message from int(s).  Note that since in this patch whitespace is stripped before the string is passed to the parser, the parser errors do not contain the whitespace.  This may actually be desirable because it helps the user to see the source of the error without being distracted by irrelevant white space.

History
Date	User	Action	Args
2010-12-02 17:38:31	belopolsky	set	recipients: + belopolsky, lemburg, mark.dickinson, vstinner, eric.smith, ezio.melotti, skrah
2010-12-02 17:38:31	belopolsky	set	messageid: <1291311511.19.0.80649786553.issue10557@psf.upfronthosting.co.za>
2010-12-02 17:38:24	belopolsky	link	issue10557 messages
2010-12-02 17:38:24	belopolsky	create