classification
Title: `int()`, `float()`, etc think python strings are null-terminated
Type: behavior Stage: resolved
Components: Interpreter Core Versions: Python 3.4, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: belopolsky, benjamin.peterson, chris.jerdonek, christian.heimes, ezio.melotti, gangesmaster, haypo, isoschiz, mark.dickinson, mrabarnett, python-dev, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2012-12-20 23:37 by gangesmaster, last changed 2013-08-06 14:48 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
issue16741.patch mrabarnett, 2012-12-30 02:52 review
issue16741#2.patch mrabarnett, 2012-12-30 17:12
int_from_str.patch serhiy.storchaka, 2013-05-05 19:40 review
int_from_str-3.3_2.patch serhiy.storchaka, 2013-06-09 17:23 Patch for 3.3 review
int_from_str-2.7.patch serhiy.storchaka, 2013-07-14 12:37 Patch for 2.7 review
Messages (18)
msg177863 - (view) Author: ganges master (gangesmaster) Date: 2012-12-20 23:37
I'm not sure if it's a bug or just an inconvenience, but when a string containing \x00 is passed to int/float/etc, they return a misleading exception:

    >>> int("abc")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: invalid literal for int() with base 10: 'abc'
    >>> int("\x00abc")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: invalid literal for int() with base 10: ''
    >>> float("\x00abc")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: could not convert string to float:

I noticed the code does actually try to handle it:
http://hg.python.org/cpython/file/39803c20c9bf/Objects/intobject.c#l1066

but still, the reported error is very misleading.
msg177915 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2012-12-22 01:30
Python takes a long way round when converting strings to int. It does the following (I'll be talking about Python 3.3 here):

1. In function 'fix_decimal_and_space_to_ascii', the different kinds of spaces are converted to " " and the different kinds of digits are converted to their equivalents in the ASCII range;

2. The resulting string is converted to UTF-8;

3. The resulting string is passed to 'PyLong_FromString', which expects a null-terminated string.

4. If 'PyLong_FromString' is unable to parse the string as an int, it builds an error message using the string that was passed into it, which it does by converting that string _back_ into Unicode.

As a result of step 4, the string that's reported as the value in the error message is _not_ necessarily correct.

For example:

>>> int("\N{ARABIC-INDIC DIGIT ONE}")
1
>>> int("#\N{ARABIC-INDIC DIGIT ONE}")
Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    int("#\N{ARABIC-INDIC DIGIT ONE}")
ValueError: invalid literal for int() with base 10: '#1'

And it also means a "\x00" and anything after it will be omitted:

>>> int("foo\x00bar")
Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    int("foo\x00bar")
ValueError: invalid literal for int() with base 10: 'foo'

And in a final point, 'PyLong_FromString' limits the length of the value it reports in the error message, and the code that does it includes this line:

    slen = strlen(orig_str) < 200 ? strlen(orig_str) : 200;
msg178003 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2012-12-23 18:12
It occurred to me that the truncation of the string when building the error message could cause a UnicodeDecodeError:

>>> int("1".ljust(199) + "\u0100")
Traceback (most recent call last):
  File "<pyshell#0>", line 1, in <module>
    int("1".ljust(199) + "\u0100")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 199: unexpected end of data

This is because it's truncating a UTF-8 string, and the truncation is in the middle of a multi-byte sequence.
msg178548 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2012-12-30 02:52
I've attached a patch.

It now reports an invalid literal as-is:

>>> int("#\N{ARABIC-INDIC DIGIT ONE}")
Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    int("#\N{ARABIC-INDIC DIGIT ONE}")
ValueError: invalid literal for int() with base 10: '#١'
>>> int("foo\x00bar")
Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    int("foo\x00bar")
ValueError: invalid literal for int() with base 10: 'foo\x00bar'

There's a slight difference in that it truncates to 200 codepoints, not 200 UTF-8 bytes.
msg178587 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2012-12-30 17:12
I've attached a small additional patch for truncating the UTF-8.

I don't know whether it's strictly necessary, but I don't know that it's unnecessary either! (Better safe than sorry.)
msg188458 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-05 19:40
Here is a patch based on Matthew's patch. It is smaller (+35 lines vs +59) but fixes error messages for more cases:

int(b'123\0') -- bytes string with null without base.
int(b'123\xbd') -- non-utf-8 bytes string.
int('123\ud800') -- lone surrogate in unicode string.

Unfortunately it is not easy to backport it to 2.7. PyErr_Format() in 2.7 works only with null-terminated strings. I propose to fix this issue on 3.3+ and declare it as "won't fix" for 2.7.
msg188463 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-05-05 20:17
int_from_str.patch:

+    strobj = PySequence_GetSlice(u, 0, 200);
+    if (strobj != NULL) {
+        PyErr_Format(PyExc_ValueError,
+                     "invalid literal for int() with base %d: %R",
+                     base, strobj);
+        Py_DECREF(strobj);
+    }

Oh, it remembers me that #7330 is still open.
msg188777 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-09 13:31
Thanks, for 3.4 I will use new formatting feature.
msg190212 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-28 13:50
Are there any other comments?
msg190866 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-06-09 17:23
Patch updated. It now reuses code for bytes->int in longobject.c and abstract.c, doesn't raise UnicodeDecodeError for non-utf-8 bytes, and always reports an invalid bytes literal as a bytes object.
msg193047 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-07-14 12:37
Here is a patch for 2.7.
msg193556 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-07-22 18:13
If there are no objections I'm going to commit patches soon.
msg193558 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-07-22 18:45
I don't like the idea to change the behavior of 2.7 so late in its release cycle. Benjamin, what's your opinion?
msg193565 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2013-07-22 20:42
Yeah, let's just fix Python 3.
msg194282 - (view) Author: Roundup Robot (python-dev) Date: 2013-08-03 18:20
New changeset ecc8512b427d by Serhiy Storchaka in branch '3.3':
Issue #16741: Fix an error reporting in int().
http://hg.python.org/cpython/rev/ecc8512b427d

New changeset 4fd48a807812 by Serhiy Storchaka in branch 'default':
Issue #16741: Fix an error reporting in int().
http://hg.python.org/cpython/rev/4fd48a807812
msg194302 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-08-03 21:16
There is a test in test_unicode which expects an UnicodeError for int('\ud800'). Now it fails. Should we fix a test or int()?
msg194304 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2013-08-03 21:45
I'd say fix the test.  Raising ValueError is correct in this case.  UnicodeError was an implementation artifact.
msg194310 - (view) Author: Roundup Robot (python-dev) Date: 2013-08-03 22:06
New changeset 7b023134ad83 by Serhiy Storchaka in branch '3.3':
Issue #16741: Remove testing of implementation artifact.
http://hg.python.org/cpython/rev/7b023134ad83

New changeset 1b4772ab420f by Serhiy Storchaka in branch 'default':
Issue #16741: Remove testing of implementation artifact.
http://hg.python.org/cpython/rev/1b4772ab420f
History
Date User Action Args
2013-08-06 14:48:58serhiy.storchakasetstatus: open -> closed
resolution: fixed
2013-08-03 22:06:07python-devsetmessages: + msg194310
2013-08-03 21:45:20belopolskysetnosy: + belopolsky
messages: + msg194304
2013-08-03 21:16:13serhiy.storchakasetstatus: closed -> open
resolution: fixed -> (no value)
messages: + msg194302
2013-08-03 18:33:00serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2013-08-03 18:20:47python-devsetnosy: + python-dev
messages: + msg194282
2013-07-22 20:42:37benjamin.petersonsetmessages: + msg193565
2013-07-22 18:45:13christian.heimessetnosy: + christian.heimes, benjamin.peterson
messages: + msg193558
2013-07-22 18:13:18serhiy.storchakasetmessages: + msg193556
2013-07-14 12:37:58serhiy.storchakasetfiles: + int_from_str-2.7.patch

messages: + msg193047
2013-06-09 17:23:10serhiy.storchakasetfiles: + int_from_str-3.3_2.patch

messages: + msg190866
2013-05-28 13:50:15serhiy.storchakasetmessages: + msg190212
2013-05-09 13:31:11serhiy.storchakasetmessages: + msg188777
2013-05-05 20:17:02hayposetmessages: + msg188463
2013-05-05 19:41:35serhiy.storchakasetassignee: serhiy.storchaka
2013-05-05 19:40:45serhiy.storchakasetfiles: + int_from_str.patch
versions: - Python 3.2
nosy: + chris.jerdonek

messages: + msg188458
2013-04-19 21:36:16isoschizsetnosy: + isoschiz
2012-12-30 17:18:22serhiy.storchakasetnosy: + serhiy.storchaka

versions: + Python 3.4
2012-12-30 17:12:39mrabarnettsetfiles: + issue16741#2.patch

messages: + msg178587
2012-12-30 05:32:02ezio.melottisetnosy: + haypo, ezio.melotti

stage: patch review
2012-12-30 02:52:19mrabarnettsetfiles: + issue16741.patch
keywords: + patch
messages: + msg178548
2012-12-23 18:12:33mrabarnettsetmessages: + msg178003
2012-12-22 01:30:16mrabarnettsetnosy: + mrabarnett
messages: + msg177915
2012-12-22 01:17:17terry.reedysetversions: - Python 2.6, Python 3.1
2012-12-21 02:03:46benjamin.petersonsetnosy: + mark.dickinson
2012-12-20 23:37:21gangesmastercreate