Message 119872 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	kristjan.jonsson
Recipients	kristjan.jonsson, pitrou
Date	2010-10-29.08:50:25
SpamBayes Score	6.2345533e-09
Marked as misclassified	No
Message-id	<1288342228.97.0.170566139041.issue10227@psf.upfronthosting.co.za>
In-reply-to

Content
In a recent email exchange on python-dev, Antoine Pitrou mentioned that slicing memoryview objects (lazy slices) wasn't necessarily very efficient when dealing with short slices. The data he posted was: $ ./python -m timeit -s "x = b'x'10000" "x[:100]" 10000000 loops, best of 3: 0.134 usec per loop $ ./python -m timeit -s "x = memoryview(b'x'10000)" "x[:100]" 10000000 loops, best of 3: 0.151 usec per loop Actually, this is not a fair comparison. A more realistic alternative to the memoryview is the bytearray, a mutable buffer. My local tests gave these numbers: python.exe -m timeit -n 10000000 -s "x = ((b'x'10000))" "x[:100]" 10000000 loops, best of 3: 0.14 usec per loop python.exe -m timeit -n 10000000 -s "x = (bytearray(b'x'10000))" "x[:100]" 10000000 loops, best of 3: 0.215 usec per loop python.exe -m timeit -n 10000000 -s "x = memoryview(bytearray(b'x'10000))" "x[:100]" 10000000 loops, best of 3: 0.163 usec per loop In this case, lazy slicing is indeed faster than greedy slicing. However, I was intrigued by how much these cases differ. Why was slicing bytes objects so much faster? Each should just result in the generation of a single object. It turns out that the slicing operation for strings (and sequences is very streamlined in the core. To address this to some extent I provide a patch with three main components: 1) There is now a single object cache of slice objects. These are generated by the core when slicing and immediately released. Reusing them if possible is very beneficial. 2) The PySlice_GetIndicesEx couldn't be optimized because of aliasing. Fixing that function sped it up considerably. 3) Creating a new api to create a memory view from a base memory view and a slice is much faster. The old way would do two copies of a Py_buffer with adverse effects on cache performance. Applying this patch provides the following figures: python.exe -m timeit -n 10000000 -s "x = ((b'x'10000))" "x[:100]" 10000000 loops, best of 3: 0.125 usec per loop python.exe -m timeit -n 10000000 -s "x = (bytearray(b'x'10000))" "x[:100]" 10000000 loops, best of 3: 0.202 usec per loop python.exe -m timeit -n 10000000 -s "x = memoryview(bytearray(b'x'10000))" "x[:100]" 10000000 loops, best of 3: 0.138 usec per loop in memoryobject.c there was a comment stating that there should be an API for this. Now there is, only internal.

In a recent email exchange on python-dev, Antoine Pitrou mentioned that slicing memoryview objects (lazy slices) wasn't necessarily very efficient when dealing with short slices.  The data he posted was:


$ ./python -m timeit -s "x = b'x'*10000" "x[:100]"
10000000 loops, best of 3: 0.134 usec per loop
$ ./python -m timeit -s "x = memoryview(b'x'*10000)" "x[:100]"
10000000 loops, best of 3: 0.151 usec per loop

Actually, this is not a fair comparison.  A more realistic alternative to the memoryview is the bytearray, a mutable buffer.  My local tests gave these numbers:

python.exe -m timeit -n 10000000 -s "x = ((b'x'*10000))" "x[:100]"
10000000 loops, best of 3: 0.14 usec per loop

python.exe -m timeit -n 10000000 -s "x = (bytearray(b'x'*10000))" "x[:100]"
10000000 loops, best of 3: 0.215 usec per loop

python.exe -m timeit -n 10000000 -s "x = memoryview(bytearray(b'x'*10000))" "x[:100]"
10000000 loops, best of 3: 0.163 usec per loop

In this case, lazy slicing is indeed faster than greedy slicing.  However, I was intrigued by how much these cases differ.  Why was slicing bytes objects so much faster?  Each should just result in the generation of a single object.

It turns out that the slicing operation for strings (and sequences is very streamlined in the core.  To address this to some extent I provide a patch with three main components:

1) There is now a single object cache of slice objects.  These are generated by the core when slicing and immediately released.  Reusing them if possible is very beneficial.
2) The PySlice_GetIndicesEx couldn't be optimized because of aliasing.  Fixing that function sped it up considerably.
3) Creating a new api to create a memory view from a base memory view and a slice is much faster.  The old way would do two copies of a Py_buffer with adverse effects on cache performance.

Applying this patch provides the following figures:
python.exe -m timeit -n 10000000 -s "x = ((b'x'*10000))" "x[:100]"
10000000 loops, best of 3: 0.125 usec per loop

python.exe -m timeit -n 10000000 -s "x = (bytearray(b'x'*10000))" "x[:100]"
10000000 loops, best of 3: 0.202 usec per loop

python.exe -m timeit -n 10000000 -s "x = memoryview(bytearray(b'x'*10000))" "x[:100]"
10000000 loops, best of 3: 0.138 usec per loop

in memoryobject.c there was a comment stating that there should be an API for this.  Now there is, only internal.

History
Date	User	Action	Args
2010-10-29 08:50:29	kristjan.jonsson	set	recipients: + kristjan.jonsson, pitrou
2010-10-29 08:50:28	kristjan.jonsson	set	messageid: <1288342228.97.0.170566139041.issue10227@psf.upfronthosting.co.za>
2010-10-29 08:50:27	kristjan.jonsson	link	issue10227 messages
2010-10-29 08:50:25	kristjan.jonsson	create