Title: Speed up slice object processing
Type: performance Stage: patch review
Components: Interpreter Core Versions: Python 3.5
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: mark.dickinson, pitrou, scoder, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2013-08-22 20:52 by scoder, last changed 2015-01-29 07:12 by scoder. This issue is now closed.

File name Uploaded Description Edit
faster_PyEval_SliceIndex.patch scoder, 2013-08-22 21:15 review
use_locals_in_PySlice_GetIndicesEx.patch scoder, 2013-08-22 21:22 review
cache_slice_indices.patch scoder, 2013-08-23 06:31 review
cache_slice_indices.patch scoder, 2014-11-16 08:35 updated patch that normalises C step=1 to Python step=None review
Messages (16)
msg195918 - (view) Author: Stefan Behnel (scoder) * Date: 2013-08-22 20:52
The get/set/delitem slicing protocol has replaced the old Py2.x get/set/delslice protocol in Py3.x. This change introduces a substantial overhead due to processing indices as Python objects rather than plain Py_ssize_t values. This overhead should be reduced as much as possible.

This ticket can be seen as a more general follow-up to issue10227.
msg195919 - (view) Author: Stefan Behnel (scoder) * Date: 2013-08-22 20:55
This tiny patch adds a fast-path to _PyEval_SliceIndex() that speeds up the slicing-heavy "fannkuch" benchmark by 8% for me.
msg195920 - (view) Author: Stefan Behnel (scoder) * Date: 2013-08-22 21:15
Sorry, broken patch. Here's a new one (same results).
msg195922 - (view) Author: Stefan Behnel (scoder) * Date: 2013-08-22 21:22
Another patch, originally proposed in issue10227 by Kristján Valur Jónsson. It uses local variables instead of pointers in PySlice_GetIndicesEx(). I see no performance difference whatsoever (Linux/gcc 4.7), but also I can't see a reason not to do this. I think it cleans up the code a bit.
msg195925 - (view) Author: Stefan Behnel (scoder) * Date: 2013-08-22 21:48
And in fact, callgrind shows an *increase* in the number of instructions executed for the "use locals" patch (898M with vs. 846M without that patch when running the fannkuch benchmark twice). That speaks against making that change, I guess.
msg195944 - (view) Author: Stefan Behnel (scoder) * Date: 2013-08-23 06:31
Here is another patch that remembers the Py_ssize_t slice indices if they are known at instantiation time. It only makes a very small difference for the "fannkuch" benchmark, so that's no reason to add both the complexity and the (IMHO ignorable) memory overhead.

However, it also adds a public C-API function PySlice_FromIndices() that allows setting the values from C code at instantiation time, thus avoiding the overhead of having to do the conversion back again.

It might also be worth exploring if we can't instantiate the Python object indices at first request using properties, iff the slice was created with integer indices. Meaning, we'd leave the PyObject* fields NULL in that case until they are being used. That would reduce the overhead of creating the slice object in the first place. Since we added the one-instance cache, it's now dominated by the creation of the index value objects when _PySlice_FromIndices() is being used internally (e.g. when calling PySequence_Get/Set/DelSlice()).
msg231215 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-11-15 20:02
Are there any benchmarks?
msg231219 - (view) Author: Stefan Behnel (scoder) * Date: 2014-11-15 21:32
As mentioned, the fannkuch benchmark benefits quite a bit from the fast path, by 8%. It was a while ago when I tested, though, so I don't have exact numbers ATM.
msg231235 - (view) Author: Stefan Behnel (scoder) * Date: 2014-11-16 08:17
Here's another idea. There could be two implementations of "slice", one that uses Python object indices (as currently) and one that has Py_ssize_t indices  (and properties for the start/stop/step attributes). Similar to what Unicode strings do, the slice type would decide which to use at instantiation time.

The internals of PySliceObject are not part of the stable ABI, and the normal way to figure out the indices is PySlice_GetIndicesEx(), which would be adapted accordingly.

This breaks compatibility with some C extensions that access the slice attributes directly, but that should be rare, given how complex index calculations are.

This change is certainly more invasive than adding Py_ssize_t fields to the existing object, though.
msg231237 - (view) Author: Stefan Behnel (scoder) * Date: 2014-11-16 08:35
Thanks for the review, Serhiy. There isn't really a reason not to 'normalise' the Python step object to None when 1 is passed from C code, so I updated the patch to do that.

Regarding on-demand creation of the Python values, the problem is that C code that accesses the struct fields directly would not be prepared to find NULL values in them, so this would be a visible change and potential source of crashes. However, my guess is that this would rarely be a problem (see my last comment on changing the slice type internally).
msg231238 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-11-16 09:47
> There could be two implementations of "slice", one that
> uses Python object indices (as currently) and one that has Py_ssize_t
> indices  (and properties for the start/stop/step attributes).

Agree, this idea LGTM. Single boolean flag should be enough to switch between 
implementations. This shouldn't break well written code. The PySliceObject 
structure is not a part of stable API.
msg231244 - (view) Author: Stefan Behnel (scoder) * Date: 2014-11-16 10:41
I reran the benchmarks with the fast path in _PyEval_SliceIndex() and the results are not conclusive. There is no clear indication that it improves the performance.

The problem is that most slices have no step, many slices are open ended on at least one side (i.e. usually have one or two fields set to None) and they are only used once, so integer index optimisations can only have a limited impact compared to instantiating the slice object in the first place.
msg231247 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-11-16 14:32
Indeed, PySequence_GetSlice() is used only in few performance non-critical places in the stdlib, PySequence_Set/DelSlice() are not used at all. So looks as this way doesn't speed up any code.

As for faster_PyEval_SliceIndex.patch see issue17576. May be such optimization is not always correct (only for exact PyLong). And why you pass through fast path only if -1<= Py_SIZE(v) <=1?
msg231256 - (view) Author: Stefan Behnel (scoder) * Date: 2014-11-16 18:43
> why you pass through fast path only if -1<= Py_SIZE(v) <=1?

It's usually enough, it's the fastest fast path, and it doesn't even need
error handling.

Any slicing where the slice calculation time matters is going to be of
fairly limited size, often involving the values (!) 1 or -1 in the slice
object fields, but definitely small integers.
msg234884 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-01-28 10:29
So may be close this issue as it doesn't affect performance?
msg234949 - (view) Author: Stefan Behnel (scoder) * Date: 2015-01-29 07:12
Closing as not worth doing.
Date User Action Args
2015-01-29 07:12:55scodersetstatus: open -> closed
resolution: rejected
messages: + msg234949
2015-01-28 10:29:34serhiy.storchakasetmessages: + msg234884
2014-11-16 18:43:12scodersetmessages: + msg231256
2014-11-16 14:32:59serhiy.storchakasetmessages: + msg231247
2014-11-16 10:41:41scodersetmessages: + msg231244
2014-11-16 09:47:57serhiy.storchakasetmessages: + msg231238
2014-11-16 08:35:48scodersetfiles: + cache_slice_indices.patch

messages: + msg231237
2014-11-16 08:17:12scodersetmessages: + msg231235
versions: + Python 3.5, - Python 3.4
2014-11-15 21:32:30scodersetmessages: + msg231219
2014-11-15 20:02:47serhiy.storchakasetmessages: + msg231215
2014-11-15 14:23:53scodersetfiles: - faster_PyEval_SliceIndex.patch
2013-08-23 07:07:02serhiy.storchakasetnosy: + mark.dickinson, serhiy.storchaka

stage: patch review
2013-08-23 06:31:53scodersetfiles: + cache_slice_indices.patch

messages: + msg195944
2013-08-22 21:48:47scodersetmessages: + msg195925
2013-08-22 21:22:44scodersetfiles: + use_locals_in_PySlice_GetIndicesEx.patch

messages: + msg195922
2013-08-22 21:15:56scodersetfiles: + faster_PyEval_SliceIndex.patch

messages: + msg195920
2013-08-22 20:55:14scodersetfiles: + faster_PyEval_SliceIndex.patch
keywords: + patch
messages: + msg195919
2013-08-22 20:52:52scodercreate