classification
Title: [patch] Fast sum() for non-numbers
Type: performance Stage:
Components: Interpreter Core Versions: Python 3.5
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: gvanrossum Nosy List: Ramchandra Apte, Sergey, aleax, ethan.furman, gvanrossum, jcea, oscarbenjamin, rhettinger, serhiy.storchaka, terry.reedy
Priority: normal Keywords: patch

Created on 2013-06-26 06:33 by Sergey, last changed 2014-12-09 18:11 by gvanrossum. This issue is now closed.

Files
File name Uploaded Description Edit
fastsum.patch Sergey, 2013-06-26 06:33
fastsum-special.patch Sergey, 2013-07-04 09:19
fastsum-special-tuplesandlists.patch Sergey, 2013-07-11 23:57 special case for lists and tuples (python 2.7 and 3.3)
fastsum-iadd_warning.patch Sergey, 2013-07-12 18:47 Warn about difference between __add__ and __iadd__
fasttuple.py Sergey, 2013-07-14 12:56 fasttuple-0.1.py
Messages (20)
msg191896 - (view) Author: Sergey (Sergey) Date: 2013-06-26 06:33
Problem
=======
Code:
  sum([[1,2,3]]*1000000, [])
takes forever to complete.

Suggestion
==========
Patch sum() function so that it did not created 1000000 copies of result, but created just one. Attached patch does that.

Before patch:
$ ./python -mtimeit --setup="x=[[1,2,3]]*10000" "sum(x,[])"
10 loops, best of 3: 915 msec per loop

After patch:
$ ./python -mtimeit --setup="x=[[1,2,3]]*10000" "sum(x,[])"
1000 loops, best of 3: 469 usec per loop

200000% boost! :)

Details
=======
Built-in sum function could look like this:
  def sum(seq, start = 0):
    for item in seq:
      start += item
    return start

But that would be bad, becaust in cases like:
  empty = []
  result = sum(list_of_lists, empty)
content of "empty" would be modified.

So instead it looks like this:
  def sum(seq, start = 0):
    for item in seq:
      start = start + item
    return start
it creates a copy of the entire partial result on EVERY "start + item".

While instead it could look like this:
  def sum(seq, start = 0):
    start = start + seq[0:1]
    for item in seq[1:]:
      start += item
    return start
create just ONE copy, and use it.

That's what attached patch is doing.

An alternative is something like this:
  def sum(seq, start = 0):
    start = copy.copy(start)
    for item in seq:
      start += item
    return start
But I'm not sure how to make a copy of arbitrary object yet.
msg192013 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2013-06-28 19:32
Performance enhancements do not normally go in bugfix releases. The issue of quadratic performance of sum(sequences, null_seq) is known, which is why the doc says that sum() is for numbers and recommends .join for strings and itertools.chain for other sequences.

sum([[1,2,3]]*n, []) == [1,2,3]*n == list(chain.from_iterable([[1,2,3]]*n))

For n = 1000000, the second takes a blink of an eye and the third under a second. So there is no issue for properly written code. More generally,

sum(listlist, []) == list(chain.from_iterable(listlist))

The latter should be comparable in speed to your patch and has the advantage of not turning the iterator into a concrete list unless and until one actually needs a concrete list.

There are two disadvantages to doing the equivalent within sum:

1. People *will* move code that depends on the internal optimization to pythons that do not have it. And they *will* complain about their program 'freezing'. This already happened when the equivalent of str.join was built into sum. It is better to use 'proper' code that will work well enough on all CPython versions and other implementations.

2. It discourages people from carefully thinking about whether they actually need a concrete list or merely the iterator for a virtual list. The latter work for sequences that are too long to fit in memory.

So my inclination is to reject the change.
msg192014 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-06-28 19:47
I agree with Terry. CPython deliberately disallow use sum() with lists of strings.
msg192025 - (view) Author: Sergey (Sergey) Date: 2013-06-28 23:52
> The issue of quadratic performance of sum(sequences, null_seq) is known

I hope it's never too late to fix some bugs... :)

> sum([[1,2,3]]*n, []) == [1,2,3]*n == list(chain.from_iterable([[1,2,3]]*n))

But if you already have a list of lists, and you need to join the lists together you have only two of those:
1. sum(list_of_lists, [])
2. list(chain.from_iterable(list_of_lists))
And using sum is much more obvious than using itertools, that most people may not (and don't have to) even know about.

When someone, not a python-guru, just thinks about that, she would think "so, I'll just add lists together, let's write a for-loop... Oh, wait, that's what sum() does, it adds things, and python is dynamic-type, sum() should work for everything". That's how I was thinking, that's how most people would think, I guess...

I was very surprised to find out about that bug.

> 1. People *will* move code that depends on the internal optimization to pythons that do not have it.

Looks like this bug is CPython-specific, others (Jython, IronPython...) don't have it, so people will move code that depends on the internal optimization to other pythons that DO have it. :)

> 2. It discourages people from carefully thinking about whether they actually need a concrete list or merely the iterator for a virtual list.

Hm... Currently people can also use iterator for sum() or list for itertools. Nothing changed...

> I agree with Terry. CPython deliberately disallow use sum() with lists of strings.

Isn't it exactly because of this bug? I mean, if this bug gets fixed, sum would be as fast as join, or maybe even faster, right? So the string restriction can be dropped later. But that would be a separate bugreport. Anyway, the bug is there not just for strings, it also happens for lists, or for any other non-numeric objects that can be added.

PS: I was ready that my patch may not get accepted, and I'm actually thinking on another way of doing that (just don't know how to get a copy of arbitrary PyObject in C yet). But I thought that the idea itself is great: finally making sum() fast without any trade-offs, what could be better? Patch works at least for 2.7, 3.3, hg-tip and can be easily ported to any other version. I have not expected to get such a cold shoulder. :(
msg192071 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-06-30 14:21
> Looks like this bug is CPython-specific, others (Jython, IronPython...) don't have it, so people will move code that depends on the internal optimization to other pythons that DO have it. :)

I don't know about IronPython, but Jython and PyPy have same behavior as CPython.

Even if the proposed implementation returns same result for common types, it may have different behavior for user types. Such changes should not do in bugfix releases, this is a new feature. I think that it is worthwhile to discuss the idea at first in the Python-Ideas mailing list [1]. But I suspect that most core developers will agree with Terry.

[1] http://mail.python.org/mailman/listinfo/python-ideas
msg192197 - (view) Author: Ramchandra Apte (Ramchandra Apte) * Date: 2013-07-02 14:16
I agree with Sergey. Would be nice if it could use copy.copy (though that might change behaviour a bit)
msg192215 - (view) Author: Sergey (Sergey) Date: 2013-07-02 18:14
> I don't know about IronPython, but Jython and PyPy have same behavior as CPython.

Right. I was wrong, at least Jython and PyPy suffer from this bug too.

> I think that it is worthwhile to discuss the idea at first in the Python-Ideas mailing list [1].

Ok, wrote to Python-Ideas, thank you.
msg192281 - (view) Author: Sergey (Sergey) Date: 2013-07-04 09:19
This patch implements another solution to the same bug. But instead of changing default code it adds a special case optimization for lists, tuples and strings, similar to those two special cases for numbers, that are already there.

=== Lists ===
No patch:
  $ ./python -mtimeit --setup="x=[[1,2,3]]*10000" "sum(x,[])"
  10 loops, best of 3: 885 msec per loop
fastsum.patch:
  $ ./python -mtimeit --setup="x=[[1,2,3]]*10000" "sum(x,[])"
  1000 loops, best of 3: 524 usec per loop
fastsum-special.patch:
  $ ./python -mtimeit --setup="x=[[1,2,3]]*10000" "sum(x,[])"
  1000 loops, best of 3: 298 usec per loop
Result: 3000 times faster.

=== Tuples ===
No patch:
  $ ./python -mtimeit --setup="x=[(1,2,3)]*10000" "sum(x,())"
  10 loops, best of 3: 585 msec per loop
fastsum.patch:
  $ ./python -mtimeit --setup="x=[(1,2,3)]*10000" "sum(x,())"
  10 loops, best of 3: 585 msec per loop
fastsum-special.patch:
  $ ./python -mtimeit --setup="x=[(1,2,3)]*10000" "sum(x,())"
  1000 loops, best of 3: 536 usec per loop
Result: 1000 times faster.

=== Strings ===
No patch (just string check removed):
  $ ./python -mtimeit --setup="x=['abc']*100000" "sum(x,'')"
  10 loops, best of 3: 1.52 sec per loop
fastsum.patch (+ string check removed):
  $ ./python -mtimeit --setup="x=['abc']*100000" "sum(x,'')"
  10 loops, best of 3: 1.52 sec per loop
fastsum-special.patch
  $ ./python -mtimeit --setup="x=['abc']*100000" "sum(x,'')"
  10 loops, best of 3: 27.8 msec per loop
join:
  $ ./python -mtimeit --setup="x=['abc']*100000" "''.join(x)"
  1000 loops, best of 3: 1.66 msec per loop
Result: 50 times faster, but still constantly slower than join.

NB:
The string part of this patch is a Proof-of-Concept, just for tests, to demonstrate, that sum can really be O(n) for strings. It was written for python 2.7.5, won't work for python3, and is not expected to be applied to python2, as it changes the behavior of sum, allowing it to sum strings.

But! If string part is stripped from the patch it works for both python2 and python3, and does not change existing behavior, except making sum faster.
msg192873 - (view) Author: Oscar Benjamin (oscarbenjamin) * Date: 2013-07-11 14:39
This "optimisation" is a semantic change. It breaks backward compatibility in cases where a = a + b and a += b do not result in the name a having the same value. In particular this breaks backward compatibility for numpy users.

Numpy arrays treat += differently from + in the sense that a += b coerces b to the same dtype as a and then adds in place whereas a + b uses Python style type promotion. This behaviour is by design and it is useful. It is also entirely appropriate (unlike e.g. summing lists) that someone would use sum() to add numpy arrays.

An example where + and += give different results:

>>> from numpy import array
>>> a1 = array([1, 2, 3], dtype=int)
>>> a1
array([1, 2, 3])
>>> a2 = array([.5, .5, .5], dtype=float)
>>> a2
array([ 0.5,  0.5,  0.5])
>>> a1 + a2
array([ 1.5,  2.5,  3.5])
>>> a1 += a2
>>> a1
array([1, 2, 3])
msg192874 - (view) Author: Ethan Furman (ethan.furman) * (Python committer) Date: 2013-07-11 15:21
So making this a speed-up for generic objects using __iadd__ is out.

The only question remaining is: is it worth special casing a few specific objects (list, tuple, str) to optimise performance?

The str question has already been answered:  it's special cased to raise an error.
msg192919 - (view) Author: Sergey (Sergey) Date: 2013-07-11 23:57
Steven D'Aprano noticed that there's an obscure case of:

>>> class A(list):
...     def __add__(self, other):
...         return A(super(A,self).__add__(other))
...     def __radd__(self, other):
...         return A(other) + self
...

Where:
>>> type( [1] + A([2]) )
<class '__main__.A'>

Is different from:

>>> type( [1].__add__(A([2])) )
<class 'list'>

To keep such an undefined behavior unchanged I updated the fastsum-special-tuplesandlists.patch to have a more strict type check.

In case somebody would like to test it, the patch is attached. It should work for both Python 2.7 and Python 3.3, and should introduce no behavior change.
msg192956 - (view) Author: Sergey (Sergey) Date: 2013-07-12 18:47
> This "optimisation" is a semantic change. It breaks backward
> compatibility in cases where a = a + b and a += b do not result
> in the name a having the same value. In particular this breaks
> backward compatibility for numpy users.

I didn't knew that. Then I guess original fastsum.patch can't be used. Since this is not the first time when someone suggests to use __add__+__iadd__ in sum, I suggest to extend existing warning with your example so that future developers would not be tempted to follow the same approach.

Patch fastsum-iadd_warning.patch attached and can be applied to 2.7.5, 3.3.2 and hg-tip.

Apart from this patch there're still 3 options remaining (special case in sum() for some types; general interface for sequence-like types; individual optimisation for individual types), that are to be discussed yet. Example patch of special case for lists and tuples attached.
msg193048 - (view) Author: Sergey (Sergey) Date: 2013-07-14 12:56
Attached fasttuple.py is a Proof-of-Concept implementation of tuple, that reuses same data storage when possible. Its possible usage looks similar to built-in tuples:
  from fasttuple import ft
  a = ft([1,2])
  b = a + ft([3,4])
  c = b + ft([5,6])
  d = b + ft([7,8])
  d += ft([9])
  d = ft([0]) + d + ft([0])
  print(a, b, c, d)

An interesting side-effect of this implementation is a faster __add__ operator:

Python 2.7.5:
  Adding 100000 of fasttuples
  took 0.23242688179 seconds
  Adding 100000 of built-in tuples
  took 25.2749021053 seconds

Python 3.3.2:
  Adding 100000 of fasttuples
  took 0.2883174419403076 seconds
  Adding 100000 of built-in tuples
  took 25.487935066223145 seconds

(see test() function in fasttuple.py)

This is just a proof of concept, it can be improved in different ways. Similar optimization can be applied to lists.
msg193049 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-07-14 13:33
I guess such implementation of tuple will increase memory usage and creation time which are critical important for tuples.
msg193052 - (view) Author: Sergey (Sergey) Date: 2013-07-14 14:49
> I guess such implementation of tuple will increase memory usage
> and creation time which are critical important for tuples.

On the contrary, it will reduce memory usage and creation time compared to regular tuples, because in cases like:
  c = a + b
you do not have to spend time and memory for allocating and copying elements of "a".

The only case when it could use more memory is if you explicitly delete "c" after that operation. But this can be solved too, internal storage can be resized to a smaller value when its tail elements are not used any more.

This idea can be improved in many ways. For example it's possible to implement __add__ in C so that it would require no additional memory at all. But it is just a proof of concept, and I was trying to keep it simple, so the idea was easier to understand.
msg193081 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-07-15 08:11
> On the contrary, it will reduce memory usage and creation time compared to regular tuples, because in cases like:
>   c = a + b
> you do not have to spend time and memory for allocating and copying elements of "a".

This is not a common case. A common case is creating short tuples and keeping a lot of tuples in memory.

> The only case when it could use more memory is if you explicitly delete "c" after that operation. But this can be solved too, internal storage can be resized to a smaller value when its tail elements are not used any more.

No. For fast += you need keep not only a size of tuple, but also a size of of allocated memory. It's a cause of sys.getsizeof([1, 2]) > sys.getsizeof((1, 2)).

For fast + you need even more complicated internal structure.

Tuples should be compact and fast. You shouldn't optimize a rare case at the cost of regression in common usage.
msg193146 - (view) Author: Sergey (Sergey) Date: 2013-07-16 02:07
> This is not a common case. A common case is creating short tuples and keeping a lot of tuples in memory.

> For fast += you need keep not only a size of tuple, but also a size of allocated memory. It's a cause of sys.getsizeof([1, 2]) > sys.getsizeof((1, 2)).

Agree. But is it worth worrying about? How much RAM could be taken by it? E.g. python test uses ~30000 tuples in peak. So if we add e.g. 32 bytes to tuple then python test would use 106MB of RAM instead of 105MB. 1%? ;)

> You shouldn't optimize a rare case at the cost of regression in common usage.

This could optimize many cases, like instant tuple copy, or instant conversions of lists to tuples and back, if this idea is extended to lists. It may also work for strings.
msg232288 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014-12-07 23:30
Sergey, please stop calling the current documented behavior a bug. https://docs.python.org/3/library/functions.html#sum says 'The iterable‘s items are normally numbers ... To concatenate a series of iterables, consider using itertools.chain().

To make a change, there must be discussion and approval on python-ideas, including by Guido.  I believe that GvR's rejection of optimizing for strings covers optimizing for other sequences.  If you want to continue, make a pep-like post there that summarizes the discussion here and makes your current proposal.
msg232351 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2014-12-09 08:18
Guido, do you care to close this one?

Terry, has channeled you and thinks it may be against your wishes to turn sum() into a general purpose sequence concatenator.   I also seem to recall you expresses similar thoughts back when Alex Martelli first designed sum() many years ago.

If it helps, here is the guidance currently expressed in the docs for __builtin__.sum():

"""
For some use cases, there are good alternatives to sum(). The preferred, fast way to concatenate a sequence of strings is by calling ''.join(sequence). To add floating point values with extended precision, see math.fsum(). To concatenate a series of iterables, consider using itertools.chain().
"""
msg232397 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2014-12-09 18:11
The example code is a perversion and should not be encouraged.
History
Date User Action Args
2014-12-09 18:11:30gvanrossumsetstatus: open -> closed
resolution: rejected
messages: + msg232397
2014-12-09 08:18:51rhettingersetassignee: gvanrossum

messages: + msg232351
nosy: + gvanrossum
2014-12-07 23:30:34terry.reedysetmessages: + msg232288
versions: + Python 3.5, - Python 3.4
2013-07-16 02:07:55Sergeysetmessages: + msg193146
2013-07-15 08:12:06serhiy.storchakasetnosy: + rhettinger
2013-07-15 08:11:00serhiy.storchakasetmessages: + msg193081
2013-07-14 14:49:11Sergeysetmessages: + msg193052
2013-07-14 13:33:59serhiy.storchakasetmessages: + msg193049
2013-07-14 12:56:11Sergeysetfiles: + fasttuple.py

messages: + msg193048
2013-07-12 18:47:28Sergeysetfiles: + fastsum-iadd_warning.patch

messages: + msg192956
2013-07-11 23:57:52Sergeysetfiles: + fastsum-special-tuplesandlists.patch

messages: + msg192919
2013-07-11 15:21:56ethan.furmansetmessages: + msg192874
2013-07-11 14:39:23oscarbenjaminsetnosy: + oscarbenjamin
messages: + msg192873
2013-07-10 21:55:54ethan.furmansetnosy: + ethan.furman
2013-07-04 09:19:39Sergeysetfiles: + fastsum-special.patch

messages: + msg192281
2013-07-02 18:14:17Sergeysetmessages: + msg192215
2013-07-02 14:16:02Ramchandra Aptesetnosy: + Ramchandra Apte
messages: + msg192197
2013-06-30 14:21:56serhiy.storchakasetnosy: + aleax

messages: + msg192071
versions: + Python 3.4, - Python 2.7
2013-06-28 23:52:21Sergeysetmessages: + msg192025
versions: + Python 2.7, - Python 3.4
2013-06-28 19:47:05serhiy.storchakasetmessages: + msg192014
2013-06-28 19:33:00terry.reedysetnosy: + terry.reedy

messages: + msg192013
versions: + Python 3.4, - Python 2.7
2013-06-26 21:08:34vstinnersetnosy: + serhiy.storchaka
2013-06-26 12:23:27jceasetnosy: + jcea
2013-06-26 06:33:59Sergeycreate