Title: Avoid unnecessary copies for list, set, and bytearray ops.
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.8
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: arigo, brandtbucher, josh.r, pablogsal, rhettinger, serhiy.storchaka, tim.peters
Priority: normal Keywords: patch

Created on 2019-03-07 21:39 by brandtbucher, last changed 2019-03-14 05:04 by rhettinger. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 12226 closed brandtbucher, 2019-03-07 21:40
Messages (18)
msg337442 - (view) Author: Brandt Bucher (brandtbucher) * Date: 2019-03-07 21:39
Binary operations on collections are, in general, of quadratic complexity. However, we can sometimes operate in-place if we know that we hold the only reference to the object. This allows us to avoid making many intermediate copies when summing many lists (or dicts ;), taking the union of many sets, or working with literals.

The attached PR adds a simple 2-line refcount check which delegates to the corresponding in-place operation for: list_concat, list_repeat, set_and, set_or, set_xor, set_sub, bytearray_concat, and bytearray_repeat.
msg337727 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-03-12 08:46
This is an interesting idea. But I have two concerns.

1. It is hard to implement refcount-based optimization on Python implementations which do not use reference counting (i.e. PyPy). If the effect of this optimization will be significant, people will become writing a code that depends on it, and this will cause problems on other implementations.

2. Currently list1 + list2 + list3 returns a list which allocates the exact amount of memory needed to contain its content. But with the proposed changes the result list could preallocate more memory. If the result is a long living object, this can cause to wasting of memory.
msg337780 - (view) Author: Brandt Bucher (brandtbucher) * Date: 2019-03-12 17:21
Thank for the input, Serhiy. On point (1):

It's a valid concern... but I don't think it's necessarily a good enough reason to withhold a simple, yet *significant* performance increase for a common use case.

This reminds me of past CPython implementation details, such as ordered dictionaries in 3.6. Some users may errantly adopt problematic idioms that aren't strictly portable between implementations, but if you ask me the performance improvement (again, from quadratic to linear) for three common built-in types (I admit I'm being a bit generous to bytearray here ;) is well worth it.

Regarding (2): 

I believe that CPython lists only overallocate by something like 1/8, which in my opinion is a fairly conservative value. If the lists are large enough for this small preallocation to be a real memory burden, I would imagine that making the extra copies would be even worse.

And, if you ask me, list overallocation is a feature, not a bug!
msg337803 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-03-12 20:39
I intend to make a python-dev post about this so we can decide whether to move forward or not.   For now, here are a few thoughts:

* Something like this was implemented for strings as an effort to mitigate the harm from a common mistake of using chained addition to build-up string rather than using string join.  IIRC, there as a decision at that time not to do something similar for other containers.

* FWIW, the non-operator versions of these operations already support passing in multiple arguments and is the preferred way to do it.

* I think this PR would help some code in the wild.  But typically, people know that building series of new containers will tend to make copies.  It would be surprising if it didn't.

* We really don't want people to rely on this optimization. It is very fragile. If someone ever adds an extra reference, the chained operations become quadratic again.

Right now, I'm -0 on the change.
msg337804 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-03-12 20:45
> Binary operations on collections are, in general, of quadratic complexity.

BTW, this seems like drastic mischaracterization, making it seem like something is wrong with the containers.  It is more accurate to say that an uncommon user pattern of repeated copy-and-add operations on containers can give quadratic behavior.

Likewise, it is a mischaracterization to say this PR "fixes" the containers.  Instead, it uses what is arguably a hack to recognize potential cases where the copy step can be skipped, even though that is what the user explicitly specified should occur.

Armin, do you remember the outcome of the discussions when you added the string optimization for repeated concatenation.  I do remember that we told users not to rely on it and that the str.join() was still the recommended best practice.
msg337810 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-03-12 21:17
I'm at best -0 on this, and on the edge of -1.  We already supply mutate-in-place operations for when this is wanted, and, e.g., for lists, it's as dead easy to spell as

    L1 += L2

Half the point of the hack for string catenation was that the similar-looking

    S1 += S2

did _not_ mutate in place.  But that spelling for lists already does.

Beyond that, is it actually safe for set operations?  Those need to compare elements, so can call into Python code via custom __eq__ implementations, which in turn can do anything, including boosting the number of ways to reach the original inputs (i.e., increase their refcounts).  For that reason I believe it can be visible if, e.g.,

    Set1 = Set1 & Set2

mutates the original set bound to Set1.  The binding via name `Set1` may be the only way to reach the original set before that statement executes, but the process of computing the RHS _may_ create additional references to the original set object.

Even if it doesn't, when element __eq__s are running other threads may run too, and pick up ever-changing values for the set object `Set1` refers to if it's mutated in place.

None of those were potential issues for hacking string catenation (which never calls back into Python code - and holds the GIL - for the duration).
msg337857 - (view) Author: Brandt Bucher (brandtbucher) * Date: 2019-03-13 16:12
I'm sorry, Raymond. I didn't mean to imply that this is a problem that needs fixing. I just saw an opportunity for a cheap, effective performance boost for a potentially expensive operation. Thanks for clarifying.

And on users "expecting" copies: many advanced Python programmers I know still recently thought that "a < b < c" was evaluated as "(a < b) < c". It's nice to be pleasantly surprised with "better" behavior... especially when this pretty solidly falls under the category of "implementation detail", since it's not outwardly visible.

And speaking of visibility:

> Beyond that, is it actually safe for set operations?

I'm not sure how it wouldn't be. None of the sets in your example should be able to be mutated, since the operation you described doesn't hit the new branch at all.

I've also spent parts of yesterday and today trying to craft malicious objects/threads that do what you described, and haven't been able to. I've found main roadblock to doing so is the fact that the named references still aren't mutated, and the first (and only) intermediate copy is invisible to everything except PyNumberAdd. I also haven't been able to get a handle to the left-hand set in a subsequent "+" operation from a contained object. Maybe I misunderstood you, though.

However, both of you are much better engineers than I am, so I'll defer to your judgement here. If the set portion of the PR seems vulnerable, I'm eager to modify or remove it.
msg337858 - (view) Author: Brandt Bucher (brandtbucher) * Date: 2019-03-13 16:16
> a subsequent "+" operation...

Sorry, I meant "&|-" here.
msg337859 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-03-13 16:31
Brandt, it's quite possible I have no real idea what this patch is aimed at.  I didn't reverse-engineer the code, the title is extremely broad, and there are no Python examples I can find for what it _is_ aimed at.  If, e.g., it's impossible that this patch has any effect on what happens for

    Set1 = Set1 & Set2

then the concerns I raised may well be irrelevant.
msg337862 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-03-13 16:50
It could aimed at list1 + list2 + list3. Or foo() + list2, where foo() returns a new list. Or other examples when the left operand is a new object.

But it is not correct to say that it is a problem that currently these operations has quadratic complexity. They have the complexity O(N^2) where N is the number of additions. Unless you have expressions with many thousands of additions, this is not a trouble (actually you can get a crash in the attempt to compile such expression in CPython).
msg337864 - (view) Author: Brandt Bucher (brandtbucher) * Date: 2019-03-13 16:57
I apologize - I should have been clearer about what this accomplishes. I'll use list addition as an example, but this works for any binary operation on these containers.

If the left operand has a refcount of exactly one, it will simply mutate in-place rather than making an unnecessary copy. The simplest example of this is with literals:

[0] + list1

This just modifies the literal [0] in-place rather than creating an unnecessary copy. However, the more common use case is adding named lists. Currently, when adding list1 + list2 + list3, two copies (with one reference each) are made:

- The intermediate result of list1 + list2.
- The sum of this intermediate result and list3.

Only the second of these is actually returned - the first is used once and thrown away. The effect of this patch is to only create *at most one* (instead of n-1) copy for any arbitrarily long list summation, as this intermediate result will just mutate in-place for lists 3-n.
msg337869 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-03-13 18:51
Serhiy, if I understand this, it _could_ pay big with even just a few operations.  For example,

    [0] * 1000000 + [1] + [2] + [3]
As-is, we'll copy the million zeroes three times.  WIth the patch, more likely the original vector of a million zeroes would be extended in-place, three times with a single element each time.

But I don't know that people do that in real code.  It would be great to find real examples in real could that would benefit.
msg337882 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2019-03-13 22:57
This patch is based on the following reasoning: if 'a' is a list and the reference count of 'a' is equal to 1, then we can mutate in-place 'a' in a call to 'a->ob_type->tp_as_sequence->list_concat'.  Typically that is called from 'PyNumber_Add(a, b)'.  The patch is only correct for the case where PyNumber_Add() is called from Python/ceval.c.  It is clearly wrong if you consider calls to PyNumber_Add() from random C extension modules.  Some extension modules' authors would be very surprised if the following code starts giving nonsense:

    PyObject *a = PyList_New();
    PyObject *b = PyNumber_Add(a, some_other_list);
    /* here, OF COURSE a must still be an empty list and b != a */

By comparison, if you consider the hack that I'm guilty for doing long ago to improve string concatenation, you'll see that it is done entirely inside ceval.c, and not in stringobject.c or unicodeobject.c.

For this reason I consider the whole patch, as written now, as bogus.
msg337883 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2019-03-13 23:02
...or PySequence_Concat() instead of PyNumber_Add(); same reasoning.
msg337884 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2019-03-13 23:25
Raymond said: "FWIW, the non-operator versions of these operations already support passing in multiple arguments and is the preferred way to do it."

True for sets with union/intersection/difference (and their inplace equivalents), but not for lists and bytearrays, which don't have an out-of-place version, and the in-place version, extend, only takes a single iterable argument.

Perhaps extend could be take multiple iterables as positional varargs, not just a single iterable, to make the multi-extend case work efficiently?

Not sure how often this actually comes up though; maybe I've just internalized good practices, but I only rarely find myself combining containers where the left-most operand is an unreferenced temporary, or combining many containers at once in a single line (I use itertools.chain wrapped in a constructor, or literals with the unpacking generalization for most of those cases).
msg337886 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2019-03-13 23:44
I believe Tim Peters's concern with:

    Set1 = Set1 & Set2

is misplaced, because he's drawn an analogy with the string concatenation optimization, which *does* handle:

    Str1 = Str1 + Str2

because the optimization is in the ceval loop and can check the subsequent instruction for a "simple store", clearing the target's reference if the target and the left operand are the same thing. This change doesn't try to handle this case, so the only time the optimization takes place is when the only reference to the left hand operand is on the stack (not in any named variable at all).

Tim's concern *would* make it unsafe to extend this patch to apply the target clearing optimization, because at global scope, clearing the target would be observable from __hash__/__eq__ (and from other threads); str is safe because it holds the GIL the whole time, and can't invoke arbitrary Python level code that might release it, set doesn't have that guarantee.
msg337888 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-03-14 00:00
Josh, I agree - but do read Armin's response.  The string hack was done inside ceval.c, where we have extraordinary knowledge of context.  Adding similar hacks inside Python C API functions seems to be a non-starter - they can't magically start mutating arguments based on refcounts alone.  Armin's dead-simple example is compelling.

So if this is desirable at all, it would need to get hairier.
msg337894 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-03-14 05:04
Everyone, thanks for chiming in.  Will mark this as closed/rejected.  For this reasons listed this thread, it isn't a sensible path for us (hairy at best, flat-out wrong at worst).  To the extent people ever have quadratic behavior problem, we already have explicit and clean ways to solve it with the existing API.
Date User Action Args
2019-03-14 05:04:31rhettingersetstatus: open -> closed
resolution: rejected
messages: + msg337894

stage: patch review -> resolved
2019-03-14 00:00:22tim.peterssetmessages: + msg337888
2019-03-13 23:44:52josh.rsetmessages: + msg337886
2019-03-13 23:25:40josh.rsetnosy: + josh.r
messages: + msg337884
2019-03-13 23:02:28arigosetmessages: + msg337883
2019-03-13 22:57:54arigosetmessages: + msg337882
2019-03-13 18:51:19tim.peterssetmessages: + msg337869
2019-03-13 17:13:16brandtbuchersettitle: Linear-time list, set, and bytearray ops. -> Avoid unnecessary copies for list, set, and bytearray ops.
2019-03-13 16:57:42brandtbuchersetmessages: + msg337864
2019-03-13 16:50:49serhiy.storchakasetmessages: + msg337862
2019-03-13 16:31:27tim.peterssetmessages: + msg337859
2019-03-13 16:16:24brandtbuchersetmessages: + msg337858
2019-03-13 16:12:26brandtbuchersetmessages: + msg337857
2019-03-12 21:17:04tim.peterssetnosy: + tim.peters
messages: + msg337810
2019-03-12 20:45:26rhettingersetnosy: + arigo
messages: + msg337804
2019-03-12 20:39:53rhettingersetmessages: + msg337803
2019-03-12 17:21:00brandtbuchersetmessages: + msg337780
2019-03-12 08:46:18serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg337727
2019-03-12 02:41:42pablogsalsetnosy: + pablogsal
2019-03-11 23:20:59brandtbuchersettitle: Linear-time ops for some mutable collections. -> Linear-time list, set, and bytearray ops.
2019-03-09 05:42:55terry.reedysetnosy: + rhettinger
2019-03-07 21:40:08brandtbuchersetkeywords: + patch
stage: patch review
pull_requests: + pull_request12216
2019-03-07 21:39:51brandtbuchercreate