Fix for int(string, base) wrong answers (take 2) #42513

alanmcintyre · 2005-10-24T05:33:10Z

BPO	1335972
Nosy	@tim-one, @birkenfeld
Files	python-mystrtoul3.tgz: Fix for int(string, base) errors and added tests python-mystrtoul4.tgz: Corrected patch to fix problem with literals python-mystrtoul5.tgz: Cleaned up digit lookup vector & added tests for base 10 and base 2**x

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2006-05-23.18:49:07.000>
created_at = <Date 2005-10-24.05:33:10.000>
labels = ['interpreter-core']
title = 'Fix for int(string, base) wrong answers (take 2)'
updated_at = <Date 2006-05-23.18:49:07.000>
user = 'https://bugs.python.org/alanmcintyre'

bugs.python.org fields:

activity = <Date 2006-05-23.18:49:07.000>
actor = 'tim.peters'
assignee = 'none'
closed = True
closed_date = None
closer = None
components = ['Interpreter Core']
creation = <Date 2005-10-24.05:33:10.000>
creator = 'alanmcintyre'
dependencies = []
files = ['6833', '6834', '6835']
hgrepos = []
issue_num = 1335972
keywords = ['patch']
message_count = 8.0
messages = ['48895', '48896', '48897', '48898', '48899', '48900', '48901', '48902']
nosy_count = 4.0
nosy_names = ['tim.peters', 'georg.brandl', 'alanmcintyre', 'funny_falcon']
pr_nums = []
priority = 'normal'
resolution = 'accepted'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue1335972'
versions = []

alanmcintyre · 2005-10-24T05:33:10Z

This incorporates patch bpo-1334979, adds test cases for
all the cases listed in bug bpo-1334662, and adds test
cases for evaluation of 2**32+1. There seem to be some
minor speed improvements (simplistic stats shown
below). Some simple performance test scripts have been
included in the attached file as well.

A lookup table was added for the maximum number of
digits that can never overflow on a 32-bit ulong for
each base. Overflow is only checked when this limit is
exceeded by 1, and once the input string is determined
to be too long (2 over the limit), the evaluation is
halted and an overflow indication is returned. This
appears to help reduce the evaluation time for very
long strings (no time is wasted trying to evaluate all
of it into a 32-bit ulong).

Evaluation of each character has also been replaced by
a lookup table. I'm not certain of the amount of speed
benefit obtained from this; I added it early on and
haven't had time to go back and test. It may be that
it's not worth the extra static table.

Baseline Python from CVS:
alan@tarantula:/python/dist/src# ./python -m timeit
'int("9")'
100000 loops, best of 3: 4 usec per loop
alan@tarantula:/python/dist/src# ./python -m timeit
'int("999999999")'
100000 loops, best of 3: 5.49 usec per loop
alan@tarantula:/python/dist/src# ./python -m timeit
'int("999999999999")'
100000 loops, best of 3: 11.8 usec per loop
alan@tarantula:/python/dist/src# ./python -m timeit
'int("999999999999999")'
100000 loops, best of 3: 13.4 usec per loop
alan@tarantula:~/python/dist/src# ./python -m timeit
'int("1"*600)'
1000 loops, best of 3: 997 usec per loop

Modified:
alan@tarantula:/python_testint/dist/src# ./python -m
timeit 'int("9")'
100000 loops, best of 3: 3.63 usec per loop
alan@tarantula:/python_testint/dist/src# ./python -m
timeit 'int("999999999")'
100000 loops, best of 3: 3.93 usec per loop
alan@tarantula:/python_testint/dist/src# ./python -m
timeit 'int("999999999999")'
100000 loops, best of 3: 9.79 usec per loop
alan@tarantula:/python_testint/dist/src# ./python -m
timeit 'int("999999999999999")'
100000 loops, best of 3: 11 usec per loop
alan@tarantula:~/python_testint/dist/src# ./python -m
timeit 'int("1"*600)'
1000 loops, best of 3: 905 usec per loop

10.2% faster for 1-digit int
39.7% faster for 9-digit int
20.5% faster for 12-digit int
21.8% faster for 15-digit int
10.2% faster for 600-digit int

Test program that takes 750k ints from [0, 2**32)
through stdin:
Baseline: 8.114 sec (best of 5 consecutive runs)
Modified: 6.774 sec (best of 5 consecutive runs)

19.8% faster

NOTE: This patch causes new errors in test_array and
test_compile, but it seems that these *should* be
failing given the input string for long(), unless I'm
missing something:

======================================================================
ERROR: test_repr (main.FloatTest)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "Lib/test/test_array.py", line 187, in test_repr
    self.assertEqual(a, eval(repr(a), {"array":
array.array}))
ValueError: invalid literal for long(): 10000000000.0

======================================================================
ERROR: test_repr (main.DoubleTest)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "Lib/test/test_array.py", line 187, in test_repr
    self.assertEqual(a, eval(repr(a), {"array":
array.array}))
ValueError: invalid literal for long(): 10000000000.0

test test_compile crashed -- exceptions.ValueError:
invalid literal for long():
90000000000000.

alanmcintyre · 2005-10-24T05:41:45Z

Logged In: YES
user_id=1115903

I forgot to add that these results were obtained on a PIIIm
833MHz running Linux 2.4.2, GCC 3.2.2, with the Python 2.5a0
CVS source from about 8pm EST Oct 23, 2005.

funnyfalcon · 2005-10-24T07:07:15Z

Logged In: YES
user_id=1290388

Instead of:
overflowed:

/* spool through remaining characters \*/

while ((c = Py_CHARMASK(*str)) != '\\0')

	str ++;

Shoold be
	while ((c = Py_CHARMASK(*str)) != '\0') {

		c = digitlookup[c];

	if (c \< 0 || c \>= base) /* non-"digit" character \*/

		break;

	str++;
}

And why not
static int digitlookup[] = {

37, 37, 37 ......

};
and
if (c >= base) break;

alanmcintyre · 2005-10-24T15:26:18Z

Logged In: YES
user_id=1115903

Thanks, funny_falcon - that corrected the problem with the
literals. I also included the change to the digit lookup
table.

The new patch is attached as python-mystrtoul4.tgz; it
passes all tests now on my machine.

tim-one · 2005-11-01T00:55:20Z

Logged In: YES
user_id=31435

This looks pretty good to me -- talk about every trick in the
book <wink>.

Note that the digitlimit vector is too cautious for bases that
are powers of 2. For example, it's obvious that any string of
32 bits can't overflow an unsigned long, but the table cuts
base 2 off at 31 instead. The formula should use log(2**32,
base) instead:

"N digits can't overflow" iff
base**N-1 < 2**32 iff
base**N < 2**32+1
base**N <= 2**32 iff
N <= log(2**32, base)

Assuming exact calculation of log(2**32, base) then
(dubious, really), the floor of that is exactly the maximum
safe N.

The power-of-2 bases, and base 10, should be added to the
tests. We really want to check that _all_ supported bases
work, right?

alanmcintyre · 2005-12-21T04:20:13Z

Logged In: YES
user_id=1115903

I cleaned up the digitlimit vector and test cases now
include all bases on [2, 36]. Uploaded as
python-mystrtoul5.tgz.

birkenfeld · 2006-02-19T09:43:22Z

Logged In: YES
user_id=1188172

Tim, if it looks good can it be applied?

tim-one · 2006-05-23T18:49:07Z

Logged In: YES
user_id=31435

Thank you, Alan! A variant of the patch was checked in as
revision 46133 for Python 2.5, affecting

Lib/test/test_builtin.py
Misc/ACKS
Misc/NEWS
Python/mystrtoul.c

For future reference, note that C doesn't define whether an
unqualified "char" is signed or unsigned, and your
assumption that it was signed doesn't actually work
everywhere ;-) Fixing that was the only real pain remaining
here: thank you for the careful work and testing!

alanmcintyre mannequin closed this as completed Oct 24, 2005

alanmcintyre mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Oct 24, 2005

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for int(string, base) wrong answers (take 2) #42513

Fix for int(string, base) wrong answers (take 2) #42513

alanmcintyre mannequin commented Oct 24, 2005

alanmcintyre mannequin commented Oct 24, 2005

alanmcintyre mannequin commented Oct 24, 2005

funnyfalcon mannequin commented Oct 24, 2005

alanmcintyre mannequin commented Oct 24, 2005

tim-one commented Nov 1, 2005

alanmcintyre mannequin commented Dec 21, 2005

birkenfeld commented Feb 19, 2006

tim-one commented May 23, 2006

Fix for int(string, base) wrong answers (take 2) #42513

Fix for int(string, base) wrong answers (take 2) #42513

Comments

alanmcintyre mannequin commented Oct 24, 2005

alanmcintyre mannequin commented Oct 24, 2005

alanmcintyre mannequin commented Oct 24, 2005

funnyfalcon mannequin commented Oct 24, 2005

alanmcintyre mannequin commented Oct 24, 2005

tim-one commented Nov 1, 2005

alanmcintyre mannequin commented Dec 21, 2005

birkenfeld commented Feb 19, 2006

tim-one commented May 23, 2006