New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of MemoryView slicing #54436
Comments
In a recent email exchange on python-dev, Antoine Pitrou mentioned that slicing memoryview objects (lazy slices) wasn't necessarily very efficient when dealing with short slices. The data he posted was: $ ./python -m timeit -s "x = b'x'*10000" "x[:100]"
10000000 loops, best of 3: 0.134 usec per loop
$ ./python -m timeit -s "x = memoryview(b'x'*10000)" "x[:100]"
10000000 loops, best of 3: 0.151 usec per loop Actually, this is not a fair comparison. A more realistic alternative to the memoryview is the bytearray, a mutable buffer. My local tests gave these numbers: python.exe -m timeit -n 10000000 -s "x = ((b'x'*10000))" "x[:100]" python.exe -m timeit -n 10000000 -s "x = (bytearray(b'x'*10000))" "x[:100]" python.exe -m timeit -n 10000000 -s "x = memoryview(bytearray(b'x'*10000))" "x[:100]" In this case, lazy slicing is indeed faster than greedy slicing. However, I was intrigued by how much these cases differ. Why was slicing bytes objects so much faster? Each should just result in the generation of a single object. It turns out that the slicing operation for strings (and sequences is very streamlined in the core. To address this to some extent I provide a patch with three main components:
Applying this patch provides the following figures: python.exe -m timeit -n 10000000 -s "x = (bytearray(b'x'*10000))" "x[:100]" python.exe -m timeit -n 10000000 -s "x = memoryview(bytearray(b'x'*10000))" "x[:100]" in memoryobject.c there was a comment stating that there should be an API for this. Now there is, only internal. |
You forgot to attach your patch. |
Oh dear. Here it is. |
But then, perhaps implementing the sequence protocol for memoryviews might be more efficient still. |
The sequence protocol (if I'm not confused) only work with a PyObject ** array. |
As an additional point: the PyMemoryObject has a "base" member that I think is redundant. the "view.obj" should be sufficient. |
Yes, that's what I think as well. |
In 2.x, strings are sliced using PySequence_GetSlice(). ceval.c in 3.0 is different, there is no apply_slice there (despite comments to that effect). I'd have to take another look with the profiler to figure out how bytes slicing in 3.0 works, but I suspect that it is somehow fasttracked passed the creation of slice objects, etc. |
I don't think it is fasttracked at all. |
Well then, its back to the profiler for 3.2. I did all of the profiling with 2.7 for practical reasons (it was the only version I had available at the time) and then ported the change to 3.2 today. But obviously there are different rules in 3.2 :) |
I find it a lot easier to appreciate patches that implement a single change than those that mix different changes. There are three different things in your patch, which I would like to see in at least three different commits. I'd be happy if you could separate the changes into more readable feature patches. That makes it easier to accept them. I'm generally happy about the slice changes, but you will have to benchmark the equivalent changes in Py3.2 to prove that they are similarly worth applying there. |
The benchmarks are from 3.2 |
In case I'm not clear enough: |
I've extracted and fixed the part of this patch that implements the slice object cache. In particular, PySlice_Fini() was incorrectly implemented. This patch applies cleanly for me against the latest py3k branch. |
Any benchmark numbers for the slice cache? |
I ran the list tests in pybench and got this: Test minimum run-time average run-time Repeating this gave me anything between 1.5% and 3.5% in total, with >2% for the small lists benchmark (which is the expected best case as slicing large lists obviously dominates the slice object creation). IMHO, even 2% would be pretty good for such a small change.
In any case, the ref-count needs to be re-initialised to 1. A call to _Py_NewReference() would be enough, though, following the example in listobject.c. So you can replace PyObject_INIT(obj, &PySlice_Type); by
in the patch. New patch attached. |
Well, 3% on such micro-benchmarks (and, I assume, 0% on the rest) is
Don't you also need a _Py_ForgetReference() at the other end? Or have I |
There's a "PyObject_Del(obj)" in all code paths. |
Here are some real micro benchmarks (note that the pybench benchmarks actually do lots of other stuff besides slicing): base line: $ ./python -m timeit -s 'l = list(range(100)); s=slice(None)' 'l[s]'
1000000 loops, best of 3: 0.464 usec per loop
$ ./python -m timeit -s 'l = list(range(10)); s=slice(None)' 'l[s]'
10000000 loops, best of 3: 0.149 usec per loop
$ ./python -m timeit -s 'l = list(range(10)); s=slice(None,1)' 'l[s]'
10000000 loops, best of 3: 0.135 usec per loop patched: $ ./python -m timeit -s 'l = list(range(100))' 'l[:1]'
10000000 loops, best of 3: 0.158 usec per loop
$ ./python -m timeit -s 'l = list(range(100))' 'l[:]'
1000000 loops, best of 3: 0.49 usec per loop
$ ./python -m timeit -s 'l = list(range(100))' 'l[1:]'
1000000 loops, best of 3: 0.487 usec per loop
$ ./python -m timeit -s 'l = list(range(100))' 'l[1:3]'
10000000 loops, best of 3: 0.184 usec per loop
$ ./python -m timeit -s 'l = list(range(10))' 'l[:]'
10000000 loops, best of 3: 0.185 usec per loop
$ ./python -m timeit -s 'l = list(range(10))' 'l[1:]'
10000000 loops, best of 3: 0.181 usec per loop original: $ ./python -m timeit -s 'l = list(range(100))' 'l[:1]'
10000000 loops, best of 3: 0.171 usec per loop
$ ./python -m timeit -s 'l = list(range(100))' 'l[:]'
1000000 loops, best of 3: 0.499 usec per loop
$ ./python -m timeit -s 'l = list(range(100))' 'l[1:]'
1000000 loops, best of 3: 0.509 usec per loop
$ ./python -m timeit -s 'l = list(range(100))' 'l[1:3]'
10000000 loops, best of 3: 0.198 usec per loop
$ ./python -m timeit -s 'l = list(range(10))' 'l[:]'
10000000 loops, best of 3: 0.188 usec per loop
$ ./python -m timeit -s 'l = list(range(10))' 'l[1:]'
1000000 loops, best of 3: 0.196 usec per loop So the maximum impact seems to be 8% for very short slices (<10) and it quickly goes down for longer slices where the copy impact clearly dominates. There's still some 2% for 100 items, though. I find it interesting that the base line is way below the other timings. That makes me think it's actually worth caching constant slice instances, as CPython already does for tuples. Cython also caches both now. I would expect that constant slices like [:], [1:] or [:-1] are extremely common. As you can see above, caching them could speed up slicing by up to 30% for short lists, and still some 7% for a list of length 100. Stefan |
Here's another base line test: slicing an empty list patched: $ ./python -m timeit -s 'l = []' 'l[:]'
10000000 loops, best of 3: 0.0847 usec per loop original: $ ./python -m timeit -s 'l = []' 'l[:]'
10000000 loops, best of 3: 0.0977 usec per loop That's about 13% less overhead. |
Indeed. I have never touched it, but I suppose it needs an upgrade of |
... which is exactly what this slice caching patch is there for. ;-) |
A quick test against the py3k stdlib: find -name "*.py" | while read file; do egrep '\[[-0-9]:[-0-9]\]' "$file"; done | wc -l This finds 2096 lines in 393 files. |
Created follow-up bpo-11107 for caching constant slice objects. |
Kristján, could you check out the new implementation over at bpo-10181? Slicing
Slicing with overhead for multidimensional capabilities:
Slice assignment, overhead for multidimensional capabilities
_testbuffer.c: 0.469 usec tolist
tolist, struct module overhead
_testbuffer.c: 1.38 msec (yes, that's microseconds!) |
I'm afraid I had put this matter _far_ out of my head :) Seeing the amount of discussion on that other defect (stuff I had already come across and scrathced my head over) I think there is a lot of catching up that I'd need to do and I am unable to give this any priority at the moment. python.exe -m timeit -n 10000000 -s "x = ((b'x'*10000))" "x[:100]" python.exe -m timeit -n 10000000 -s "x = (bytearray(b'x'*10000))" "x[:100]" Did you take a look at this at all? |
I see. I thought this was mainly about memoryview performance, so C:\Users\stefan\hg\pep-3118\PCbuild>amd64\python.exe -m timeit -n 10000000 -s "x = ((b'x'*10000))" "x[:100]" C:\Users\stefan\hg\pep-3118\PCbuild>amd64\python.exe -m timeit -n 10000000 -s "x = (bytearray(b'x'*10000))" "x[:100]" C:\Users\stefan\hg\pep-3118\PCbuild>amd64\python.exe -m timeit -n 10000000 -s "x = memoryview(bytearray(b'x'*10000))" "x[:100]" Linux: bytes: 10.9 usec bytearray: 0.14 usec memoryview: 0.14 usec |
With Stefan Behnel's slice-object-cache.patch, I get this (PEP-3118 branch): Linux: bytes: 0.097 usec bytearray: 0.127 usec memoryview: 0.12 usec On Linux, that's quite a nice speedup. |
Updated single slice caching patch for latest Py3.3 hg tip. |
New changeset fa2f8dd077e0 by Antoine Pitrou in branch 'default': |
Thanks Stefan. I'm leaving the issue open since the original topic is a bit different. |
Kristján, I ran the benchmarks from http://bugs.python.org/issue10227#msg143731 I also ran the profile guided optimization build for Visual Studio. The If you can reproduce similar results, I think we can close this issue. ./python -m timeit -n 10000000 -s "x = ((b'x'*10000))" "x[:100]" linux-cpython (4244e4348362): 0.102 usec windows-cpython: 0.109 usec ./python -m timeit -n 10000000 -s "x = (bytearray(b'x'*10000))" "x[:100]" linux-cpython (4244e4348362): 0.107 usec windows-cpython: 0.127 usec ./python -m timeit -n 10000000 -s "x = memoryview(bytearray(b'x'*10000))" "x[:100]" linux-cpython (4244e4348362): 0.127 usec windows-cpython: 0.145 usec |
Sure. Flagging this as fixed. Can´t close it until 10181 is closed due to some dependency thing. (perhaps someone else knows what to do?) |
Great. I removed the dependency since it's fixed in both cpython |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: