classification
Title: Fix for int(string, base) wrong answers (take 2)
Type: Stage:
Components: Interpreter Core Versions:
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: Nosy List: alanmcintyre, funny_falcon, georg.brandl, tim.peters
Priority: normal Keywords: patch

Created on 2005-10-24 05:33 by alanmcintyre, last changed 2006-05-23 18:49 by tim.peters. This issue is now closed.

Files
File name Uploaded Description Edit
python-mystrtoul3.tgz alanmcintyre, 2005-10-24 05:33 Fix for int(string, base) errors and added tests
python-mystrtoul4.tgz alanmcintyre, 2005-10-24 15:26 Corrected patch to fix problem with literals
python-mystrtoul5.tgz alanmcintyre, 2005-12-21 04:20 Cleaned up digit lookup vector & added tests for base 10 and base 2**x
Messages (8)
msg48895 - (view) Author: Alan McIntyre (alanmcintyre) (Python committer) Date: 2005-10-24 05:33
This incorporates patch #1334979, adds test cases for
all the   cases listed in bug #1334662, and adds test
cases for evaluation of 2**32+1.  There seem to be some
minor speed improvements (simplistic stats shown
below). Some simple performance test scripts have been
included in the attached file as well.

A lookup table was added for the maximum number of
digits that can never overflow on a 32-bit ulong for
each base.  Overflow is only checked when this limit is
exceeded by 1, and once the input string is determined
to be too long (2 over the limit), the evaluation is
halted and an overflow indication is returned.  This
appears to help reduce the evaluation time for very
long strings (no time is wasted trying to evaluate all
of it into a 32-bit ulong).

Evaluation of each character has also been replaced by
a lookup table.  I'm not certain of the amount of speed
benefit obtained from this; I added it early on and
haven't had time to go back and test.  It may be that
it's not worth the extra static table.

Baseline Python from CVS:
alan@tarantula:~/python/dist/src# ./python -m timeit
'int("9")'
100000 loops, best of 3: 4 usec per loop
alan@tarantula:~/python/dist/src# ./python -m timeit
'int("999999999")'
100000 loops, best of 3: 5.49 usec per loop
alan@tarantula:~/python/dist/src# ./python -m timeit
'int("999999999999")'
100000 loops, best of 3: 11.8 usec per loop
alan@tarantula:~/python/dist/src# ./python -m timeit
'int("999999999999999")'
100000 loops, best of 3: 13.4 usec per loop
alan@tarantula:~/python/dist/src# ./python -m timeit
'int("1"*600)'
1000 loops, best of 3: 997 usec per loop


Modified:
alan@tarantula:~/python_testint/dist/src# ./python -m
timeit 'int("9")'
100000 loops, best of 3: 3.63 usec per loop
alan@tarantula:~/python_testint/dist/src# ./python -m
timeit 'int("999999999")'
100000 loops, best of 3: 3.93 usec per loop
alan@tarantula:~/python_testint/dist/src# ./python -m
timeit 'int("999999999999")'  
100000 loops, best of 3: 9.79 usec per loop
alan@tarantula:~/python_testint/dist/src# ./python -m
timeit 'int("999999999999999")'
100000 loops, best of 3: 11 usec per loop
alan@tarantula:~/python_testint/dist/src# ./python -m
timeit 'int("1"*600)'
1000 loops, best of 3: 905 usec per loop

10.2% faster for 1-digit int
39.7% faster for 9-digit int
20.5% faster for 12-digit int
21.8% faster for 15-digit int
10.2% faster for 600-digit int

Test program that takes 750k ints from [0, 2**32)
through stdin:
    Baseline: 8.114 sec (best of 5 consecutive runs)
    Modified: 6.774 sec (best of 5 consecutive runs)

19.8% faster

NOTE: This patch causes new errors in test_array and
test_compile, but it seems that these *should* be
failing given the input string for long(), unless I'm
missing something:

======================================================================
ERROR: test_repr (__main__.FloatTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "Lib/test/test_array.py", line 187, in test_repr
    self.assertEqual(a, eval(repr(a), {"array":
array.array}))
ValueError: invalid literal for long(): 10000000000.0
 
======================================================================
ERROR: test_repr (__main__.DoubleTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "Lib/test/test_array.py", line 187, in test_repr
    self.assertEqual(a, eval(repr(a), {"array":
array.array}))
ValueError: invalid literal for long(): 10000000000.0
 
----------------------------------------------------------------------

test test_compile crashed -- exceptions.ValueError:
invalid literal for long():
90000000000000.
 
msg48896 - (view) Author: Alan McIntyre (alanmcintyre) (Python committer) Date: 2005-10-24 05:41
Logged In: YES 
user_id=1115903

I forgot to add that these results were obtained on a PIIIm
833MHz running Linux 2.4.2, GCC 3.2.2, with the Python 2.5a0
CVS source from about 8pm EST Oct 23, 2005.
msg48897 - (view) Author: funny_falcon (funny_falcon) Date: 2005-10-24 07:07
Logged In: YES 
user_id=1290388

Instead of:
overflowed:

	/* spool through remaining characters */

	while ((c = Py_CHARMASK(*str)) != '\0')

		str ++;

Shoold be
	while ((c = Py_CHARMASK(*str)) != '\0') {

		c = digitlookup[c];

		if (c < 0 || c >= base) /* non-"digit" character */

			break;

		str++;
	}
And why not
static int digitlookup[] = {

	37, 37, 37 ......
};
and
		if (c >= base)  break;
msg48898 - (view) Author: Alan McIntyre (alanmcintyre) (Python committer) Date: 2005-10-24 15:26
Logged In: YES 
user_id=1115903

Thanks, funny_falcon - that corrected the problem with the
literals. I also included the change to the digit lookup
table.  

The new patch is attached as python-mystrtoul4.tgz; it
passes all tests now on my machine.
msg48899 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2005-11-01 00:55
Logged In: YES 
user_id=31435

This looks pretty good to me -- talk about every trick in the 
book <wink>.

Note that the digitlimit vector is too cautious for bases that 
are powers of 2.  For example, it's obvious that any string of 
32 bits can't overflow an unsigned long, but the table cuts 
base 2 off at 31 instead.  The formula should use log(2**32, 
base) instead:

"N digits can't overflow" iff
base**N-1 < 2**32  iff
base**N < 2**32+1
base**N <= 2**32  iff
N <= log(2**32, base)

Assuming exact calculation of log(2**32, base) then 
(dubious, really), the floor of that is exactly the maximum 
safe N.

The power-of-2 bases, and base 10, should be added to the 
tests.  We really want to check that _all_ supported bases 
work, right?
msg48900 - (view) Author: Alan McIntyre (alanmcintyre) (Python committer) Date: 2005-12-21 04:20
Logged In: YES 
user_id=1115903

I cleaned up the digitlimit vector and test cases now
include all bases on [2, 36].  Uploaded as
python-mystrtoul5.tgz.
msg48901 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2006-02-19 09:43
Logged In: YES 
user_id=1188172

Tim, if it looks good can it be applied?
msg48902 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2006-05-23 18:49
Logged In: YES 
user_id=31435

Thank you, Alan!  A variant of the patch was checked in as
revision 46133 for Python 2.5, affecting

Lib/test/test_builtin.py
Misc/ACKS
Misc/NEWS
Python/mystrtoul.c

For future reference, note that C doesn't define whether an
unqualified "char" is signed or unsigned, and your
assumption that it was signed doesn't actually work
everywhere ;-)  Fixing that was the only real pain remaining
here:  thank you for the careful work and testing!
History
Date User Action Args
2005-10-24 05:33:10alanmcintyrecreate