msg177863 - (view) |
Author: ganges master (gangesmaster) |
Date: 2012-12-20 23:37 |
I'm not sure if it's a bug or just an inconvenience, but when a string containing \x00 is passed to int/float/etc, they return a misleading exception:
>>> int("abc")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: 'abc'
>>> int("\x00abc")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: ''
>>> float("\x00abc")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: could not convert string to float:
I noticed the code does actually try to handle it:
http://hg.python.org/cpython/file/39803c20c9bf/Objects/intobject.c#l1066
but still, the reported error is very misleading.
|
msg177915 - (view) |
Author: Matthew Barnett (mrabarnett) * |
Date: 2012-12-22 01:30 |
Python takes a long way round when converting strings to int. It does the following (I'll be talking about Python 3.3 here):
1. In function 'fix_decimal_and_space_to_ascii', the different kinds of spaces are converted to " " and the different kinds of digits are converted to their equivalents in the ASCII range;
2. The resulting string is converted to UTF-8;
3. The resulting string is passed to 'PyLong_FromString', which expects a null-terminated string.
4. If 'PyLong_FromString' is unable to parse the string as an int, it builds an error message using the string that was passed into it, which it does by converting that string _back_ into Unicode.
As a result of step 4, the string that's reported as the value in the error message is _not_ necessarily correct.
For example:
>>> int("\N{ARABIC-INDIC DIGIT ONE}")
1
>>> int("#\N{ARABIC-INDIC DIGIT ONE}")
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
int("#\N{ARABIC-INDIC DIGIT ONE}")
ValueError: invalid literal for int() with base 10: '#1'
And it also means a "\x00" and anything after it will be omitted:
>>> int("foo\x00bar")
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
int("foo\x00bar")
ValueError: invalid literal for int() with base 10: 'foo'
And in a final point, 'PyLong_FromString' limits the length of the value it reports in the error message, and the code that does it includes this line:
slen = strlen(orig_str) < 200 ? strlen(orig_str) : 200;
|
msg178003 - (view) |
Author: Matthew Barnett (mrabarnett) * |
Date: 2012-12-23 18:12 |
It occurred to me that the truncation of the string when building the error message could cause a UnicodeDecodeError:
>>> int("1".ljust(199) + "\u0100")
Traceback (most recent call last):
File "<pyshell#0>", line 1, in <module>
int("1".ljust(199) + "\u0100")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 199: unexpected end of data
This is because it's truncating a UTF-8 string, and the truncation is in the middle of a multi-byte sequence.
|
msg178548 - (view) |
Author: Matthew Barnett (mrabarnett) * |
Date: 2012-12-30 02:52 |
I've attached a patch.
It now reports an invalid literal as-is:
>>> int("#\N{ARABIC-INDIC DIGIT ONE}")
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
int("#\N{ARABIC-INDIC DIGIT ONE}")
ValueError: invalid literal for int() with base 10: '#١'
>>> int("foo\x00bar")
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
int("foo\x00bar")
ValueError: invalid literal for int() with base 10: 'foo\x00bar'
There's a slight difference in that it truncates to 200 codepoints, not 200 UTF-8 bytes.
|
msg178587 - (view) |
Author: Matthew Barnett (mrabarnett) * |
Date: 2012-12-30 17:12 |
I've attached a small additional patch for truncating the UTF-8.
I don't know whether it's strictly necessary, but I don't know that it's unnecessary either! (Better safe than sorry.)
|
msg188458 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-05-05 19:40 |
Here is a patch based on Matthew's patch. It is smaller (+35 lines vs +59) but fixes error messages for more cases:
int(b'123\0') -- bytes string with null without base.
int(b'123\xbd') -- non-utf-8 bytes string.
int('123\ud800') -- lone surrogate in unicode string.
Unfortunately it is not easy to backport it to 2.7. PyErr_Format() in 2.7 works only with null-terminated strings. I propose to fix this issue on 3.3+ and declare it as "won't fix" for 2.7.
|
msg188463 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2013-05-05 20:17 |
int_from_str.patch:
+ strobj = PySequence_GetSlice(u, 0, 200);
+ if (strobj != NULL) {
+ PyErr_Format(PyExc_ValueError,
+ "invalid literal for int() with base %d: %R",
+ base, strobj);
+ Py_DECREF(strobj);
+ }
Oh, it remembers me that #7330 is still open.
|
msg188777 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-05-09 13:31 |
Thanks, for 3.4 I will use new formatting feature.
|
msg190212 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-05-28 13:50 |
Are there any other comments?
|
msg190866 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-06-09 17:23 |
Patch updated. It now reuses code for bytes->int in longobject.c and abstract.c, doesn't raise UnicodeDecodeError for non-utf-8 bytes, and always reports an invalid bytes literal as a bytes object.
|
msg193047 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-07-14 12:37 |
Here is a patch for 2.7.
|
msg193556 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-07-22 18:13 |
If there are no objections I'm going to commit patches soon.
|
msg193558 - (view) |
Author: Christian Heimes (christian.heimes) * |
Date: 2013-07-22 18:45 |
I don't like the idea to change the behavior of 2.7 so late in its release cycle. Benjamin, what's your opinion?
|
msg193565 - (view) |
Author: Benjamin Peterson (benjamin.peterson) * |
Date: 2013-07-22 20:42 |
Yeah, let's just fix Python 3.
|
msg194282 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2013-08-03 18:20 |
New changeset ecc8512b427d by Serhiy Storchaka in branch '3.3':
Issue #16741: Fix an error reporting in int().
http://hg.python.org/cpython/rev/ecc8512b427d
New changeset 4fd48a807812 by Serhiy Storchaka in branch 'default':
Issue #16741: Fix an error reporting in int().
http://hg.python.org/cpython/rev/4fd48a807812
|
msg194302 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-08-03 21:16 |
There is a test in test_unicode which expects an UnicodeError for int('\ud800'). Now it fails. Should we fix a test or int()?
|
msg194304 - (view) |
Author: Alexander Belopolsky (belopolsky) * |
Date: 2013-08-03 21:45 |
I'd say fix the test. Raising ValueError is correct in this case. UnicodeError was an implementation artifact.
|
msg194310 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2013-08-03 22:06 |
New changeset 7b023134ad83 by Serhiy Storchaka in branch '3.3':
Issue #16741: Remove testing of implementation artifact.
http://hg.python.org/cpython/rev/7b023134ad83
New changeset 1b4772ab420f by Serhiy Storchaka in branch 'default':
Issue #16741: Remove testing of implementation artifact.
http://hg.python.org/cpython/rev/1b4772ab420f
|
|
Date |
User |
Action |
Args |
2022-04-11 14:57:39 | admin | set | github: 60945 |
2014-05-23 20:39:13 | terry.reedy | link | issue21546 superseder |
2013-08-06 14:48:58 | serhiy.storchaka | set | status: open -> closed resolution: fixed |
2013-08-03 22:06:07 | python-dev | set | messages:
+ msg194310 |
2013-08-03 21:45:20 | belopolsky | set | nosy:
+ belopolsky messages:
+ msg194304
|
2013-08-03 21:16:13 | serhiy.storchaka | set | status: closed -> open resolution: fixed -> (no value) messages:
+ msg194302
|
2013-08-03 18:33:00 | serhiy.storchaka | set | status: open -> closed resolution: fixed stage: patch review -> resolved |
2013-08-03 18:20:47 | python-dev | set | nosy:
+ python-dev messages:
+ msg194282
|
2013-07-22 20:42:37 | benjamin.peterson | set | messages:
+ msg193565 |
2013-07-22 18:45:13 | christian.heimes | set | nosy:
+ christian.heimes, benjamin.peterson messages:
+ msg193558
|
2013-07-22 18:13:18 | serhiy.storchaka | set | messages:
+ msg193556 |
2013-07-14 12:37:58 | serhiy.storchaka | set | files:
+ int_from_str-2.7.patch
messages:
+ msg193047 |
2013-06-09 17:23:10 | serhiy.storchaka | set | files:
+ int_from_str-3.3_2.patch
messages:
+ msg190866 |
2013-05-28 13:50:15 | serhiy.storchaka | set | messages:
+ msg190212 |
2013-05-09 13:31:11 | serhiy.storchaka | set | messages:
+ msg188777 |
2013-05-05 20:17:02 | vstinner | set | messages:
+ msg188463 |
2013-05-05 19:41:35 | serhiy.storchaka | set | assignee: serhiy.storchaka |
2013-05-05 19:40:45 | serhiy.storchaka | set | files:
+ int_from_str.patch versions:
- Python 3.2 nosy:
+ chris.jerdonek
messages:
+ msg188458
|
2013-04-19 21:36:16 | isoschiz | set | nosy:
+ isoschiz
|
2012-12-30 17:18:22 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka
versions:
+ Python 3.4 |
2012-12-30 17:12:39 | mrabarnett | set | files:
+ issue16741#2.patch
messages:
+ msg178587 |
2012-12-30 05:32:02 | ezio.melotti | set | nosy:
+ vstinner, ezio.melotti
stage: patch review |
2012-12-30 02:52:19 | mrabarnett | set | files:
+ issue16741.patch keywords:
+ patch messages:
+ msg178548
|
2012-12-23 18:12:33 | mrabarnett | set | messages:
+ msg178003 |
2012-12-22 01:30:16 | mrabarnett | set | nosy:
+ mrabarnett messages:
+ msg177915
|
2012-12-22 01:17:17 | terry.reedy | set | versions:
- Python 2.6, Python 3.1 |
2012-12-21 02:03:46 | benjamin.peterson | set | nosy:
+ mark.dickinson
|
2012-12-20 23:37:21 | gangesmaster | create | |