Message 177915 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mrabarnett
Recipients	gangesmaster, mark.dickinson, mrabarnett
Date	2012-12-22.01:30:13
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1356139816.18.0.452391860006.issue16741@psf.upfronthosting.co.za>
In-reply-to

Content
Python takes a long way round when converting strings to int. It does the following (I'll be talking about Python 3.3 here): 1. In function 'fix_decimal_and_space_to_ascii', the different kinds of spaces are converted to " " and the different kinds of digits are converted to their equivalents in the ASCII range; 2. The resulting string is converted to UTF-8; 3. The resulting string is passed to 'PyLong_FromString', which expects a null-terminated string. 4. If 'PyLong_FromString' is unable to parse the string as an int, it builds an error message using the string that was passed into it, which it does by converting that string _back_ into Unicode. As a result of step 4, the string that's reported as the value in the error message is _not_ necessarily correct. For example: >>> int("\N{ARABIC-INDIC DIGIT ONE}") 1 >>> int("#\N{ARABIC-INDIC DIGIT ONE}") Traceback (most recent call last): File "<pyshell#1>", line 1, in <module> int("#\N{ARABIC-INDIC DIGIT ONE}") ValueError: invalid literal for int() with base 10: '#1' And it also means a "\x00" and anything after it will be omitted: >>> int("foo\x00bar") Traceback (most recent call last): File "<pyshell#2>", line 1, in <module> int("foo\x00bar") ValueError: invalid literal for int() with base 10: 'foo' And in a final point, 'PyLong_FromString' limits the length of the value it reports in the error message, and the code that does it includes this line: slen = strlen(orig_str) < 200 ? strlen(orig_str) : 200;

Python takes a long way round when converting strings to int. It does the following (I'll be talking about Python 3.3 here):

1. In function 'fix_decimal_and_space_to_ascii', the different kinds of spaces are converted to " " and the different kinds of digits are converted to their equivalents in the ASCII range;

2. The resulting string is converted to UTF-8;

3. The resulting string is passed to 'PyLong_FromString', which expects a null-terminated string.

4. If 'PyLong_FromString' is unable to parse the string as an int, it builds an error message using the string that was passed into it, which it does by converting that string _back_ into Unicode.

As a result of step 4, the string that's reported as the value in the error message is _not_ necessarily correct.

For example:

>>> int("\N{ARABIC-INDIC DIGIT ONE}")
1
>>> int("#\N{ARABIC-INDIC DIGIT ONE}")
Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    int("#\N{ARABIC-INDIC DIGIT ONE}")
ValueError: invalid literal for int() with base 10: '#1'

And it also means a "\x00" and anything after it will be omitted:

>>> int("foo\x00bar")
Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    int("foo\x00bar")
ValueError: invalid literal for int() with base 10: 'foo'

And in a final point, 'PyLong_FromString' limits the length of the value it reports in the error message, and the code that does it includes this line:

    slen = strlen(orig_str) < 200 ? strlen(orig_str) : 200;

History
Date	User	Action	Args
2012-12-22 01:30:16	mrabarnett	set	recipients: + mrabarnett, mark.dickinson, gangesmaster
2012-12-22 01:30:16	mrabarnett	set	messageid: <1356139816.18.0.452391860006.issue16741@psf.upfronthosting.co.za>
2012-12-22 01:30:15	mrabarnett	link	issue16741 messages
2012-12-22 01:30:13	mrabarnett	create