classification
Title: sum() several times slower on Python 3 64-bit
Type: performance Stage: needs patch
Components: Interpreter Core Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: lukasz.langa, mark.dickinson, rhettinger, scoder, serhiy.storchaka, steven.daprano
Priority: normal Keywords: patch

Created on 2015-04-29 18:36 by lukasz.langa, last changed 2018-08-12 12:44 by pitrou.

Files
File name Uploaded Description Edit
pylong_freelist.patch scoder, 2015-05-01 14:02 review
unpack_single_digits.patch scoder, 2018-08-12 12:36
Messages (15)
msg242238 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2015-04-29 18:36
I got a report that summing numbers is noticably slower on Python 3. This is easily reproducible:


$ time python2.7 -c "print sum(xrange(3, 10**9, 3)) + sum(xrange(5, 10**9, 5)) - sum(xrange(15, 10**9, 15))"
233333333166666668

real    0m6.165s
user    0m6.100s
sys     0m0.032s


$ time python3.4 -c "print(sum(range(3, 10**9, 3)) + sum(range(5, 10**9, 5)) - sum(range(15, 10**9, 15)))"
233333333166666668

real    0m16.413s
user    0m16.086s
sys     0m0.089s


I can't tell from initial poking what's the core issue here. Both examples produce equivalent bytecode, the builtin_sum() function is only noticably different in the fact that it uses PyLong_* across the board, including PyLong_AsLongAndOverlow. We'll need to profile this, which I didn't have time for yet.
msg242241 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-04-29 19:02
Can't reproduce on 32-bit Linux.

$ time python2.7 -c "print sum(xrange(3, 10**9, 3)) + sum(xrange(5, 10**9, 5)) - sum(xrange(15, 10**9, 15))"
233333333166666668

real    1m11.614s
user    1m11.376s
sys     0m0.056s
$ time python3.4 -c "print(sum(range(3, 10**9, 3)) + sum(range(5, 10**9, 5)) - sum(range(15, 10**9, 15)))"
233333333166666668

real    1m11.658s
user    1m10.980s
sys     0m0.572s

$ python2.7 -m timeit -n1 -r1 "sum(xrange(3, 10**9, 3)) + sum(xrange(5, 10**9, 5)) - sum(xrange(15, 10**9, 15))"
1 loops, best of 1: 72 sec per loop
$ python3.4 -m timeit -n1 -r1 "sum(range(3, 10**9, 3)) + sum(range(5, 10**9, 5)) - sum(range(15, 10**9, 15))"
1 loops, best of 1: 72.5 sec per loop

$ python2.7 -m timeit -s "a = list(range(10**6))" -- "sum(a)"
10 loops, best of 3: 114 msec per loop
$ python3.4 -m timeit -s "a = list(range(10**6))" -- "sum(a)"
10 loops, best of 3: 83.5 msec per loop

What is sys.int_info on your build?
msg242242 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-04-29 19:04
I reproduce under 64-bit Linux. So this may be because the Python long digit (30 bits) is smaller than the C long (64 bits).

Lukasz: is there a specific use case? Note you can use Numpy for such calculations.
msg242243 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2015-04-29 19:23
Serhiy, this is 64-bit specific. Antoine, as far as I can tell, the main use case is: "Don't make it look like migrating to Python 3 is a terrible performance downgrade."

As we discussed on the language summit this year [1], we have to be at least not worse to look appealing. This might be a flawed benchmark but people will make them anyway. In this particular case, there's internal usage at Twitter that unearthed it. The example is just a simplified repro.

Some perf degradations were expected, like switching text to Unicode. In this case, the end result computed by both 2.7 and 3.4 is the same so we should be able to address this.

[1] http://lwn.net/Articles/640224/
msg242244 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-04-29 19:34
If that's due to the different representation of Python 2's int type and Python 3's int type then I don't see an easy solution to this.
msg242259 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2015-04-30 03:50
Łukasz: there are three ingredients here - sum, (x)range and the integer addition that sum will be performing at each iteration.  Is there any chance you can separate the effects on your machine?

On my machine (OS X, 64-bit), I'm seeing *some* speed difference in the integer arithmetic, but not enough to explain the whole of the timing mismatch.

One thing we've lost in Python 3 is the fast path for small-int addition *inside* the ceval loop.  It may be possible to restore something there.
msg242260 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2015-04-30 04:08
Throwing out sum, I'm seeing significant slowdown simply from xrange versus range:

taniyama:Desktop mdickinson$ python2 -m timeit -s 'x = xrange(3, 10**9, 3)' 'for e in x: pass'
10 loops, best of 3: 5.01 sec per loop
taniyama:Desktop mdickinson$ python3 -m timeit -s 'x = range(3, 10**9, 3)' 'for e in x: pass'
10 loops, best of 3: 8.62 sec per loop
msg242262 - (view) Author: Stefan Behnel (scoder) * Date: 2015-04-30 05:30
> there are three ingredients here - sum, (x)range and the integer addition that sum will be performing at each iteration.

... not to forget the interpreter startup time on his machine. :)

I did a tiny bit of profiling and about 90% of the time seems to be spent creating and deallocating throw-away PyLong objects. My guess is that it simply lacks a free-list in _PyLong_New().
msg242294 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-04-30 23:35
It seems we (like the benchmarks posted) are spending a whole lot of time on something that's probably not relevant to any real-world situation.

If someone has actual code that suffers from this, it would be good to know about it.
(note by the way that summing on a range() can be done O(1): it's just a variation on a arithmetic series)
msg242300 - (view) Author: Stefan Behnel (scoder) * Date: 2015-05-01 06:09
I don't think it's irrelevant. Throw-away integers are really not uncommon. For-loops use them quite often, non-trivial arithmetic expressions can create a lot of intermediate temporaries. Speeding up the create-delete cycle of PyLong sounds like a very obvious thing to do.

Imagine some code that iterates over a list of integers, applies some calculation to them, and then stores them in a new list, maybe even using a list comprehension or so. If you could speed up the intermediate calculation by avoiding overhead in creating temporary PyLong objects, such code could benefit a lot.

I suspect that adding a free-list for single-digit PyLong objects (the most common case) would provide some visible benefit.
msg242302 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-05-01 10:44
Le 01/05/2015 08:09, Stefan Behnel a écrit :
> 
> I don't think it's irrelevant. Throw-away integers are really not
uncommon. For-loops use them quite often, non-trivial arithmetic
expressions can create a lot of intermediate temporaries. Speeding up
the create-delete cycle of PyLong sounds like a very obvious thing to do.

That may be a good thing indeed. I'm just saying that the benchmarks
people are worried about here are completely pointless.
msg242310 - (view) Author: Stefan Behnel (scoder) * Date: 2015-05-01 14:02
I tried implementing a freelist. Patch attached, mostly adapted from the one in dictobject.c, but certainly needs a bit of cleanup.

The results are not bad, about 10-20% faster:

Original:

$ ./python -m timeit 'sum(range(1, 100000))'
1000 loops, best of 3: 1.86 msec per loop

$ ./python -m timeit -s 'l = list(range(1000, 10000))' '[(i*2+5) // 7 for i in l]'
1000 loops, best of 3: 1.05 msec per loop


With freelist:

$ ./python -m timeit 'sum(range(1, 100000))'
1000 loops, best of 3: 1.52 msec per loop

$ ./python -m timeit -s 'l = list(range(1000, 10000))' '[(i*2+5) // 7 for i in l]'
1000 loops, best of 3: 931 usec per loop
msg242357 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2015-05-01 23:46
Antoine asked: 
> If someone has actual code that suffers from this, it would be good to know about it.

You might have missed Łukasz' earlier comment: "In this particular case, there's internal usage at Twitter that unearthed it. The example is just a simplified repro."
msg242914 - (view) Author: Stefan Behnel (scoder) * Date: 2015-05-11 20:08
Issue 24165 was created to pursue the path of a free-list for PyLong objects.
msg323443 - (view) Author: Stefan Behnel (scoder) * Date: 2018-08-12 12:36
FWIW, a PGO build of Py3.7 is now about 20% *faster* here than my Ubuntu 16/04 system Python 2.7, and for some (probably unrelated) reason, the system Python 3.5 is another 2% faster on my side.

IMHO, the only other thing that seems obvious to try would be to inline the unpacking of single digit PyLongs into sum(). I attached a simple patch that does that, in case someone wants to test it out. For non-PGO builds, it's about 17% faster for me. Didn't take the time to benchmark PGO builds with it.
History
Date User Action Args
2018-08-12 12:44:11pitrousetnosy: - pitrou
2018-08-12 12:36:17scodersetfiles: + unpack_single_digits.patch

messages: + msg323443
versions: - Python 3.5
2017-04-11 08:16:30louielusettitle: sum() several times slower on Python 3 -> sum() several times slower on Python 3 64-bit
2017-04-11 08:14:25louielusetversions: + Python 3.7
2015-05-11 20:08:55scodersetmessages: + msg242914
2015-05-01 23:46:19steven.dapranosetnosy: + steven.daprano
messages: + msg242357
2015-05-01 14:02:21scodersetfiles: + pylong_freelist.patch
keywords: + patch
messages: + msg242310
2015-05-01 10:44:15pitrousetmessages: + msg242302
2015-05-01 06:09:58scodersetmessages: + msg242300
2015-04-30 23:35:11pitrousetmessages: + msg242294
2015-04-30 05:30:12scodersetnosy: + scoder
messages: + msg242262
2015-04-30 04:08:30mark.dickinsonsetmessages: + msg242260
2015-04-30 03:50:27mark.dickinsonsetnosy: + mark.dickinson
messages: + msg242259
2015-04-29 19:34:25pitrousetmessages: + msg242244
versions: - Python 3.4
2015-04-29 19:23:39lukasz.langasetmessages: + msg242243
2015-04-29 19:04:22pitrousetmessages: + msg242242
2015-04-29 19:02:03serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg242241
2015-04-29 18:36:58lukasz.langacreate