classification
Title: Long object free list optimization
Type: performance Stage:
Components: Interpreter Core Versions: Python 3.0
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: gregory.p.smith Nosy List: christian.heimes, gregory.p.smith, pitrou
Priority: normal Keywords: patch

Created on 2008-02-05 03:07 by christian.heimes, last changed 2008-08-18 04:56 by gregory.p.smith. This issue is now closed.

Files
File name Uploaded Description Edit
long_freelist.patch christian.heimes, 2008-02-05 03:07
py3k_longfreelist.patch christian.heimes, 2008-02-06 17:09
py3k_longfreelist2.patch pitrou, 2008-02-12 22:19
py3k_longfreelist2-gps01.patch gregory.p.smith, 2008-03-22 19:54 improvement over longfreelist2
py3k_longfreelist2-gps02.patch gregory.p.smith, 2008-03-23 01:24
issue2013-benchmarks-gps02.txt gregory.p.smith, 2008-03-23 02:04 benchmarks of the -gps02 patch
py3k_longfreelist2_gps03.patch gregory.p.smith, 2008-08-18 04:56 updated, now works in debug mode.
py3k_longfreelist2_gps04.patch gregory.p.smith, 2008-08-18 04:56 resimplified for only 1 digit longs
Messages (13)
msg62060 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-02-05 03:07
The patch adds a free list of long objects with size 1 or -1. It has
quite a tremendous effect on code like list(range(1000)).

$ ./python -m timeit "for i in range(100): list(range(1000))"

Without patch:
10 loops, best of 3: 79 msec per loop

With patch:
10 loops, best of 3: 20.8 msec per loop
msg62106 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-02-06 17:09
The updated patch limits the free list.
msg62338 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-02-12 22:19
I don't get the same impressive speedup as you do, although it's still
significant.

$ ./python -m timeit "for i in range(100): list(range(1000))"
Without patch:
100 loops, best of 3: 5.05 msec per loop
With patch:
100 loops, best of 3: 3.57 msec per loop

Also, your patch is leaky. I'm attaching a fixed version (for py3k,
didn't check the trunk version).
msg62339 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-02-12 22:32
Christian, maybe you did your measurements with the DEBUG flag turned
on? That would explain the discrepancy.

Also, I'm not sure the patch is useful for 2.x since long objects with
size -1 or 1 should be represented as ints instead :-)
msg62342 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-02-12 22:48
Antoine Pitrou wrote:
> Christian, maybe you did your measurements with the DEBUG flag turned
> on? That would explain the discrepancy.
> 
> Also, I'm not sure the patch is useful for 2.x since long objects with
> size -1 or 1 should be represented as ints instead :-)

Yes, I've used a Py_Debug build to measure the speed difference.

You are right. The patch makes no sense for the 2.x series.

Christian
msg64331 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-03-22 16:45
Here are some updated timings against the current py3k branch:

$ ./python -m timeit "sum(range(10000))"

Without patch:
1000 loops, best of 3: 675 usec per loop
With patch:
1000 loops, best of 3: 462 usec per loop

$ ./python -m timeit -s "n=1000000" "sum(range(n, n+10000))"

Without patch:
1000 loops, best of 3: 1.36 msec per loop
With patch:
1000 loops, best of 3: 1.38 msec per loop

$ ./python -m timeit "min(range(255))"

Without patch:
10000 loops, best of 3: 18.7 usec per loop
With patch:
10000 loops, best of 3: 19.4 usec per loop

As you can see the patch makes things quite a bit better for 1-digit
long objects, and there is only a small slowdown for longer or tiny
ints. Given that 1-digit long objects should be prevalent in most code I
think it's probably a winner overall.
msg64337 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-03-22 19:54
I'm attaching a slightly improved version of the longfreelist2 patch. 
It moves the important test to be first and removes the unneeded ? :
conditional.

........................
Here are some more benchmarks:
........................

--- issue2013-without-patch.txt	2008-03-22 12:18:37.000000000 -0700
+++ issue2013-longfreelist2-gps01-size-first.txt	2008-03-22
12:34:44.000000000 -0700
@@ -1,43 +1,43 @@
-3.0a3+ (py3k:61746M, Mar 22 2008, 12:12:29) 
+3.0a3+ (py3k:61746M, Mar 22 2008, 12:33:47) 
 [GCC 4.0.1 (Apple Computer, Inc. build 5367)]
 
 ./python.exe -m timeit min(range(255))
 
-100000 loops, best of 3: 17.2 usec per loop
+100000 loops, best of 3: 15.8 usec per loop
 
 ./python.exe -m timeit -s n=1000000 sum(range(n,n+10000))
 
-1000 loops, best of 3: 1.21 msec per loop
+1000 loops, best of 3: 1.22 msec per loop
 
 ./python.exe -m timeit sum(range(-32768,32768))
 
-100 loops, best of 3: 3.98 msec per loop
+100 loops, best of 3: 2.63 msec per loop
 
 ./python.exe -m timeit sum(range(-1000000,1000000))
 
-10 loops, best of 3: 296 msec per loop
+10 loops, best of 3: 273 msec per loop
 
 ./python.exe -m timeit sum(range(4000))
 
-1000 loops, best of 3: 239 usec per loop
+10000 loops, best of 3: 151 usec per loop
 
 ./python.exe -m timeit sum(list(range(4000)))
 
-1000 loops, best of 3: 360 usec per loop
+1000 loops, best of 3: 274 usec per loop
 
 ./python.exe -m timeit sum(range(10000))
 
-1000 loops, best of 3: 615 usec per loop
+1000 loops, best of 3: 382 usec per loop
 
 ./python.exe -m timeit sum(list(range(10000)))
 
-1000 loops, best of 3: 887 usec per loop
+1000 loops, best of 3: 802 usec per loop
 
 ./python.exe -m timeit sum(range(40000))
 
-100 loops, best of 3: 2.55 msec per loop
+100 loops, best of 3: 1.79 msec per loop
 
 ./python.exe -m timeit sum(range(80000))
 
-100 loops, best of 3: 6.4 msec per loop
+100 loops, best of 3: 5.77 msec per loop
 

........................
And a benchmarks of the longfreelist2 patch before my improvement:
........................

--- issue2013-without-patch.txt	2008-03-22 12:18:37.000000000 -0700
+++ issue2013-longfreelist2.txt	2008-03-22 12:19:46.000000000 -0700
@@ -1,43 +1,43 @@
-3.0a3+ (py3k:61746M, Mar 22 2008, 12:12:29) 
+3.0a3+ (py3k:61746M, Mar 22 2008, 12:18:57) 
 [GCC 4.0.1 (Apple Computer, Inc. build 5367)]
 
 ./python.exe -m timeit min(range(255))
 
-100000 loops, best of 3: 17.2 usec per loop
+100000 loops, best of 3: 16.1 usec per loop
 
 ./python.exe -m timeit -s n=1000000 sum(range(n,n+10000))
 
-1000 loops, best of 3: 1.21 msec per loop
+1000 loops, best of 3: 1.25 msec per loop
 
 ./python.exe -m timeit sum(range(-32768,32768))
 
-100 loops, best of 3: 3.98 msec per loop
+100 loops, best of 3: 2.95 msec per loop
 
 ./python.exe -m timeit sum(range(-1000000,1000000))
 
-10 loops, best of 3: 296 msec per loop
+10 loops, best of 3: 283 msec per loop
 
 ./python.exe -m timeit sum(range(4000))
 
-1000 loops, best of 3: 239 usec per loop
+1000 loops, best of 3: 177 usec per loop
 
 ./python.exe -m timeit sum(list(range(4000)))
 
-1000 loops, best of 3: 360 usec per loop
+1000 loops, best of 3: 276 usec per loop
 
 ./python.exe -m timeit sum(range(10000))
 
-1000 loops, best of 3: 615 usec per loop
+1000 loops, best of 3: 453 usec per loop
 
 ./python.exe -m timeit sum(list(range(10000)))
 
-1000 loops, best of 3: 887 usec per loop
+1000 loops, best of 3: 792 usec per loop
 
 ./python.exe -m timeit sum(range(40000))
 
-100 loops, best of 3: 2.55 msec per loop
+100 loops, best of 3: 2.03 msec per loop
 
 ./python.exe -m timeit sum(range(80000))
 
-100 loops, best of 3: 6.4 msec per loop
+100 loops, best of 3: 6.01 msec per loop
 

........................
And diffs of longfreelist2 vs longfreelist2-gps01:
........................
--- issue2013-longfreelist2.txt	2008-03-22 12:19:46.000000000 -0700
+++ issue2013-longfreelist2-gps01-size-first.txt	2008-03-22
12:34:44.000000000 -0700
@@ -1,43 +1,43 @@
-3.0a3+ (py3k:61746M, Mar 22 2008, 12:18:57) 
+3.0a3+ (py3k:61746M, Mar 22 2008, 12:33:47) 
 [GCC 4.0.1 (Apple Computer, Inc. build 5367)]
 
 ./python.exe -m timeit min(range(255))
 
-100000 loops, best of 3: 16.1 usec per loop
+100000 loops, best of 3: 15.8 usec per loop
 
 ./python.exe -m timeit -s n=1000000 sum(range(n,n+10000))
 
-1000 loops, best of 3: 1.25 msec per loop
+1000 loops, best of 3: 1.22 msec per loop
 
 ./python.exe -m timeit sum(range(-32768,32768))
 
-100 loops, best of 3: 2.95 msec per loop
+100 loops, best of 3: 2.63 msec per loop
 
 ./python.exe -m timeit sum(range(-1000000,1000000))
 
-10 loops, best of 3: 283 msec per loop
+10 loops, best of 3: 273 msec per loop
 
 ./python.exe -m timeit sum(range(4000))
 
-1000 loops, best of 3: 177 usec per loop
+10000 loops, best of 3: 151 usec per loop
 
 ./python.exe -m timeit sum(list(range(4000)))
 
-1000 loops, best of 3: 276 usec per loop
+1000 loops, best of 3: 274 usec per loop
 
 ./python.exe -m timeit sum(range(10000))
 
-1000 loops, best of 3: 453 usec per loop
+1000 loops, best of 3: 382 usec per loop
 
 ./python.exe -m timeit sum(list(range(10000)))
 
-1000 loops, best of 3: 792 usec per loop
+1000 loops, best of 3: 802 usec per loop
 
 ./python.exe -m timeit sum(range(40000))
 
-100 loops, best of 3: 2.03 msec per loop
+100 loops, best of 3: 1.79 msec per loop
 
 ./python.exe -m timeit sum(range(80000))
 
-100 loops, best of 3: 6.01 msec per loop
+100 loops, best of 3: 5.77 msec per loop
msg64351 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-03-23 01:24
Looking at how much memory was actually used by PyLongObjects the way
our code allocates them... we can use this freelist for 2 digit PyLongs
on 32-bit systems and for 4 digit PyLongs on 64-bit systems.  Doing this
speeds things up even more as numbers > 32767 are still quite common.

why?  sizeof(PyVarObject) == 12, sizeof(digit) == 2.  Objects/obmalloc.c
allocates on 8, 16 and 32 byte boundaries.  Theres no such thing as a 14
byte allocation, its 16 byte aligned.  The same applies for 64-bit
platforms where the sizeof(PyVarObject) == 24, our allocation size is 32
 bytes there.

patch attached as -gps02.
msg64352 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-03-23 02:04
And the benchmarks of gps02... see the attached file.

At the end I had it run pybench rather than these microbenchmarks that
obviously show the speedup, the -gps02 patch ran 1.3% faster best case a
0.5% faster on average (32-bit; my mac's toolchain can't build enough of
py3k 64-bit to run pybench).

Things left for someone to do?  Determine what a good size for the
PyLong free list is.  4096 was a magic number chosen by tiran.
msg64373 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-03-23 19:05
The problem with choosing a sensible freelist size is that we don't have
any reference workloads. However, I just tested with 10000 and it
doesn't seem to slow anything down anyway. It doesn't make our
microbenchmarks

I thought the patch to compact freelists at each full gc collection had
been committed, but it doesn't seem there. Perhaps it will change
matters quite a bit. On the one hand, it will allow for bigger freelists
with less worries of degrading memory footprint (but still, potential
cache pollution). On the other hand, the bigger the freelists, the more
expensive it is to deallocate them.
msg69018 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-06-30 21:20
Gregory, your patch probably deserves checking in, it doesn't seem to me
there is any concern preventing you to do that.
msg71306 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-08-18 01:15
preventing this right now is that when i apply this to py3k today, it
fails miserably in a debug build.

first, my patch has an invalid assert(size > 0) in it in _PyLong_New as
0 size is used when the value is 0.  get rid of that line.

then things at least run but you'll end up in an infinite loop when the
interpreter exits at best if you've compiled in debug mode.

things work great in a non-pydebug build.

i believe the reason is this change is not properly looking at the
structure allocation sizes.  debug builds add extra structure fields.

i'm investigating.  the free_list code in floatobject.c does not have
this problem so at least there's a good example to go from.

and yet another reason why a more general free list library for various
internals to use would be useful...
msg71314 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-08-18 04:56
attached is an updated patch that also works in debug mode.  to do this
i had to exclude all zero length (value = 0) PyLongObjects from the free
list.  i'm still not sure why, they all have the same underlying
allocation size.

simplifying it to only freelist one digit longs the only useful
differences are in the microbenchmark of -32767..32768.  otherwise
things are slightly slower.  pybench is identical and pystone drops a bit.

i'm closing this as not worth it as things are written.  if someone
wants to pick up the idea and play with freelists go ahead but this
doesn't need to be an open tracker issue.  there is room for improvement
but these patches aren't it.
History
Date User Action Args
2008-08-18 04:56:57gregory.p.smithsetfiles: + py3k_longfreelist2_gps04.patch
2008-08-18 04:56:17gregory.p.smithsetstatus: open -> closed
resolution: rejected
messages: + msg71314
files: + py3k_longfreelist2_gps03.patch
2008-08-18 01:15:45gregory.p.smithsetassignee: gregory.p.smith
messages: + msg71306
2008-06-30 21:20:46pitrousetmessages: + msg69018
2008-03-23 19:05:33pitrousetmessages: + msg64373
2008-03-23 02:04:46gregory.p.smithsetfiles: + issue2013-benchmarks-gps02.txt
messages: + msg64352
2008-03-23 01:24:20gregory.p.smithsetfiles: + py3k_longfreelist2-gps02.patch
messages: + msg64351
2008-03-22 19:54:14gregory.p.smithsetfiles: + py3k_longfreelist2-gps01.patch
type: enhancement -> performance
messages: + msg64337
nosy: + gregory.p.smith
versions: - Python 2.6
2008-03-22 16:45:03pitrousetmessages: + msg64331
2008-02-12 22:48:19christian.heimessetmessages: + msg62342
2008-02-12 22:32:17pitrousetmessages: + msg62339
2008-02-12 22:19:05pitrousetfiles: + py3k_longfreelist2.patch
nosy: + pitrou
messages: + msg62338
2008-02-06 17:09:57christian.heimessetfiles: + py3k_longfreelist.patch
type: enhancement
messages: + msg62106
2008-02-05 03:07:46christian.heimescreate