Long object free list optimization #46297

tiran · 2008-02-05T03:07:47Z

BPO	2013
Nosy	@gpshead, @pitrou, @tiran
Files	long_freelist.patch py3k_longfreelist.patch py3k_longfreelist2.patch py3k_longfreelist2-gps01.patch: improvement over longfreelist2 py3k_longfreelist2-gps02.patch issue2013-benchmarks-gps02.txt: benchmarks of the -gps02 patch py3k_longfreelist2_gps03.patch: updated, now works in debug mode. py3k_longfreelist2_gps04.patch: resimplified for only 1 digit longs

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/gpshead'
closed_at = <Date 2008-08-18.04:56:17.247>
created_at = <Date 2008-02-05.03:07:46.663>
labels = ['interpreter-core', 'performance']
title = 'Long object free list optimization'
updated_at = <Date 2008-08-18.04:56:57.079>
user = 'https://github.com/tiran'

bugs.python.org fields:

activity = <Date 2008-08-18.04:56:57.079>
actor = 'gregory.p.smith'
assignee = 'gregory.p.smith'
closed = True
closed_date = <Date 2008-08-18.04:56:17.247>
closer = 'gregory.p.smith'
components = ['Interpreter Core']
creation = <Date 2008-02-05.03:07:46.663>
creator = 'christian.heimes'
dependencies = []
files = ['9355', '9361', '9421', '9817', '9824', '9825', '11143', '11144']
hgrepos = []
issue_num = 2013
keywords = ['patch']
message_count = 13.0
messages = ['62060', '62106', '62338', '62339', '62342', '64331', '64337', '64351', '64352', '64373', '69018', '71306', '71314']
nosy_count = 3.0
nosy_names = ['gregory.p.smith', 'pitrou', 'christian.heimes']
pr_nums = []
priority = 'normal'
resolution = 'rejected'
stage = None
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue2013'
versions = ['Python 3.0']

tiran · 2008-02-05T03:07:45Z

The patch adds a free list of long objects with size 1 or -1. It has
quite a tremendous effect on code like list(range(1000)).

$ ./python -m timeit "for i in range(100): list(range(1000))"

Without patch:
10 loops, best of 3: 79 msec per loop

With patch:
10 loops, best of 3: 20.8 msec per loop

tiran · 2008-02-06T17:09:57Z

The updated patch limits the free list.

pitrou · 2008-02-12T22:19:05Z

I don't get the same impressive speedup as you do, although it's still
significant.

$ ./python -m timeit "for i in range(100): list(range(1000))"
Without patch:
100 loops, best of 3: 5.05 msec per loop
With patch:
100 loops, best of 3: 3.57 msec per loop

Also, your patch is leaky. I'm attaching a fixed version (for py3k,
didn't check the trunk version).

pitrou · 2008-02-12T22:32:17Z

Christian, maybe you did your measurements with the DEBUG flag turned
on? That would explain the discrepancy.

Also, I'm not sure the patch is useful for 2.x since long objects with
size -1 or 1 should be represented as ints instead :-)

tiran · 2008-02-12T22:48:20Z

Antoine Pitrou wrote:

Christian, maybe you did your measurements with the DEBUG flag turned
on? That would explain the discrepancy.

Also, I'm not sure the patch is useful for 2.x since long objects with
size -1 or 1 should be represented as ints instead :-)

Yes, I've used a Py_Debug build to measure the speed difference.

You are right. The patch makes no sense for the 2.x series.

Christian

pitrou · 2008-03-22T16:45:03Z

Here are some updated timings against the current py3k branch:

$ ./python -m timeit "sum(range(10000))"

Without patch:
1000 loops, best of 3: 675 usec per loop
With patch:
1000 loops, best of 3: 462 usec per loop

$ ./python -m timeit -s "n=1000000" "sum(range(n, n+10000))"

Without patch:
1000 loops, best of 3: 1.36 msec per loop
With patch:
1000 loops, best of 3: 1.38 msec per loop

$ ./python -m timeit "min(range(255))"

Without patch:
10000 loops, best of 3: 18.7 usec per loop
With patch:
10000 loops, best of 3: 19.4 usec per loop

As you can see the patch makes things quite a bit better for 1-digit
long objects, and there is only a small slowdown for longer or tiny
ints. Given that 1-digit long objects should be prevalent in most code I
think it's probably a winner overall.

gpshead · 2008-03-22T19:54:13Z

I'm attaching a slightly improved version of the longfreelist2 patch.
It moves the important test to be first and removes the unneeded ? :
conditional.

........................
Here are some more benchmarks:
........................

--- issue2013-without-patch.txt	2008-03-22 12:18:37.000000000 -0700
+++ issue2013-longfreelist2-gps01-size-first.txt	2008-03-22
12:34:44.000000000 -0700
@@ -1,43 +1,43 @@
-3.0a3+ (py3k:61746M, Mar 22 2008, 12:12:29) 
+3.0a3+ (py3k:61746M, Mar 22 2008, 12:33:47) 
 [GCC 4.0.1 (Apple Computer, Inc. build 5367)]
 
 ./python.exe -m timeit min(range(255))
 
-100000 loops, best of 3: 17.2 usec per loop
+100000 loops, best of 3: 15.8 usec per loop
 
 ./python.exe -m timeit -s n=1000000 sum(range(n,n+10000))
 
-1000 loops, best of 3: 1.21 msec per loop
+1000 loops, best of 3: 1.22 msec per loop
 
 ./python.exe -m timeit sum(range(-32768,32768))
 
-100 loops, best of 3: 3.98 msec per loop
+100 loops, best of 3: 2.63 msec per loop
 
 ./python.exe -m timeit sum(range(-1000000,1000000))
 
-10 loops, best of 3: 296 msec per loop
+10 loops, best of 3: 273 msec per loop
 
 ./python.exe -m timeit sum(range(4000))
 
-1000 loops, best of 3: 239 usec per loop
+10000 loops, best of 3: 151 usec per loop
 
 ./python.exe -m timeit sum(list(range(4000)))
 
-1000 loops, best of 3: 360 usec per loop
+1000 loops, best of 3: 274 usec per loop
 
 ./python.exe -m timeit sum(range(10000))
 
-1000 loops, best of 3: 615 usec per loop
+1000 loops, best of 3: 382 usec per loop
 
 ./python.exe -m timeit sum(list(range(10000)))
 
-1000 loops, best of 3: 887 usec per loop
+1000 loops, best of 3: 802 usec per loop
 
 ./python.exe -m timeit sum(range(40000))
 
-100 loops, best of 3: 2.55 msec per loop
+100 loops, best of 3: 1.79 msec per loop
 
 ./python.exe -m timeit sum(range(80000))
 
-100 loops, best of 3: 6.4 msec per loop
+100 loops, best of 3: 5.77 msec per loop

........................
And a benchmarks of the longfreelist2 patch before my improvement:
........................

--- issue2013-without-patch.txt	2008-03-22 12:18:37.000000000 -0700
+++ issue2013-longfreelist2.txt	2008-03-22 12:19:46.000000000 -0700
@@ -1,43 +1,43 @@
-3.0a3+ (py3k:61746M, Mar 22 2008, 12:12:29) 
+3.0a3+ (py3k:61746M, Mar 22 2008, 12:18:57) 
 [GCC 4.0.1 (Apple Computer, Inc. build 5367)]
 
 ./python.exe -m timeit min(range(255))
 
-100000 loops, best of 3: 17.2 usec per loop
+100000 loops, best of 3: 16.1 usec per loop
 
 ./python.exe -m timeit -s n=1000000 sum(range(n,n+10000))
 
-1000 loops, best of 3: 1.21 msec per loop
+1000 loops, best of 3: 1.25 msec per loop
 
 ./python.exe -m timeit sum(range(-32768,32768))
 
-100 loops, best of 3: 3.98 msec per loop
+100 loops, best of 3: 2.95 msec per loop
 
 ./python.exe -m timeit sum(range(-1000000,1000000))
 
-10 loops, best of 3: 296 msec per loop
+10 loops, best of 3: 283 msec per loop
 
 ./python.exe -m timeit sum(range(4000))
 
-1000 loops, best of 3: 239 usec per loop
+1000 loops, best of 3: 177 usec per loop
 
 ./python.exe -m timeit sum(list(range(4000)))
 
-1000 loops, best of 3: 360 usec per loop
+1000 loops, best of 3: 276 usec per loop
 
 ./python.exe -m timeit sum(range(10000))
 
-1000 loops, best of 3: 615 usec per loop
+1000 loops, best of 3: 453 usec per loop
 
 ./python.exe -m timeit sum(list(range(10000)))
 
-1000 loops, best of 3: 887 usec per loop
+1000 loops, best of 3: 792 usec per loop
 
 ./python.exe -m timeit sum(range(40000))
 
-100 loops, best of 3: 2.55 msec per loop
+100 loops, best of 3: 2.03 msec per loop
 
 ./python.exe -m timeit sum(range(80000))
 
-100 loops, best of 3: 6.4 msec per loop
+100 loops, best of 3: 6.01 msec per loop
 

........................
And diffs of longfreelist2 vs longfreelist2-gps01:
........................
--- issue2013-longfreelist2.txt	2008-03-22 12:19:46.000000000 -0700
+++ issue2013-longfreelist2-gps01-size-first.txt	2008-03-22
12:34:44.000000000 -0700
@@ -1,43 +1,43 @@
-3.0a3+ (py3k:61746M, Mar 22 2008, 12:18:57) 
+3.0a3+ (py3k:61746M, Mar 22 2008, 12:33:47) 
 [GCC 4.0.1 (Apple Computer, Inc. build 5367)]
 
 ./python.exe -m timeit min(range(255))
 
-100000 loops, best of 3: 16.1 usec per loop
+100000 loops, best of 3: 15.8 usec per loop
 
 ./python.exe -m timeit -s n=1000000 sum(range(n,n+10000))
 
-1000 loops, best of 3: 1.25 msec per loop
+1000 loops, best of 3: 1.22 msec per loop
 
 ./python.exe -m timeit sum(range(-32768,32768))
 
-100 loops, best of 3: 2.95 msec per loop
+100 loops, best of 3: 2.63 msec per loop
 
 ./python.exe -m timeit sum(range(-1000000,1000000))
 
-10 loops, best of 3: 283 msec per loop
+10 loops, best of 3: 273 msec per loop
 
 ./python.exe -m timeit sum(range(4000))
 
-1000 loops, best of 3: 177 usec per loop
+10000 loops, best of 3: 151 usec per loop
 
 ./python.exe -m timeit sum(list(range(4000)))
 
-1000 loops, best of 3: 276 usec per loop
+1000 loops, best of 3: 274 usec per loop
 
 ./python.exe -m timeit sum(range(10000))
 
-1000 loops, best of 3: 453 usec per loop
+1000 loops, best of 3: 382 usec per loop
 
 ./python.exe -m timeit sum(list(range(10000)))
 
-1000 loops, best of 3: 792 usec per loop
+1000 loops, best of 3: 802 usec per loop
 
 ./python.exe -m timeit sum(range(40000))
 
-100 loops, best of 3: 2.03 msec per loop
+100 loops, best of 3: 1.79 msec per loop
 
 ./python.exe -m timeit sum(range(80000))
 
-100 loops, best of 3: 6.01 msec per loop
+100 loops, best of 3: 5.77 msec per loop

gpshead · 2008-03-23T01:24:17Z

Looking at how much memory was actually used by PyLongObjects the way
our code allocates them... we can use this freelist for 2 digit PyLongs
on 32-bit systems and for 4 digit PyLongs on 64-bit systems. Doing this
speeds things up even more as numbers > 32767 are still quite common.

why? sizeof(PyVarObject) == 12, sizeof(digit) == 2. Objects/obmalloc.c
allocates on 8, 16 and 32 byte boundaries. Theres no such thing as a 14
byte allocation, its 16 byte aligned. The same applies for 64-bit
platforms where the sizeof(PyVarObject) == 24, our allocation size is 32
bytes there.

patch attached as -gps02.

gpshead · 2008-03-23T02:04:43Z

And the benchmarks of gps02... see the attached file.

At the end I had it run pybench rather than these microbenchmarks that
obviously show the speedup, the -gps02 patch ran 1.3% faster best case a
0.5% faster on average (32-bit; my mac's toolchain can't build enough of
py3k 64-bit to run pybench).

Things left for someone to do? Determine what a good size for the
PyLong free list is. 4096 was a magic number chosen by tiran.

pitrou · 2008-03-23T19:05:32Z

The problem with choosing a sensible freelist size is that we don't have
any reference workloads. However, I just tested with 10000 and it
doesn't seem to slow anything down anyway. It doesn't make our
microbenchmarks

I thought the patch to compact freelists at each full gc collection had
been committed, but it doesn't seem there. Perhaps it will change
matters quite a bit. On the one hand, it will allow for bigger freelists
with less worries of degrading memory footprint (but still, potential
cache pollution). On the other hand, the bigger the freelists, the more
expensive it is to deallocate them.

pitrou · 2008-06-30T21:20:46Z

Gregory, your patch probably deserves checking in, it doesn't seem to me
there is any concern preventing you to do that.

gpshead · 2008-08-18T01:15:44Z

preventing this right now is that when i apply this to py3k today, it
fails miserably in a debug build.

first, my patch has an invalid assert(size > 0) in it in _PyLong_New as
0 size is used when the value is 0. get rid of that line.

then things at least run but you'll end up in an infinite loop when the
interpreter exits at best if you've compiled in debug mode.

things work great in a non-pydebug build.

i believe the reason is this change is not properly looking at the
structure allocation sizes. debug builds add extra structure fields.

i'm investigating. the free_list code in floatobject.c does not have
this problem so at least there's a good example to go from.

and yet another reason why a more general free list library for various
internals to use would be useful...

gpshead · 2008-08-18T04:56:11Z

attached is an updated patch that also works in debug mode. to do this
i had to exclude all zero length (value = 0) PyLongObjects from the free
list. i'm still not sure why, they all have the same underlying
allocation size.

simplifying it to only freelist one digit longs the only useful
differences are in the microbenchmark of -32767..32768. otherwise
things are slightly slower. pybench is identical and pystone drops a bit.

i'm closing this as not worth it as things are written. if someone
wants to pick up the idea and play with freelists go ahead but this
doesn't need to be an open tracker issue. there is room for improvement
but these patches aren't it.

tiran added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Feb 5, 2008

tiran added the type-feature A feature request or enhancement label Feb 6, 2008

gpshead added performance Performance or resource usage and removed type-feature A feature request or enhancement labels Mar 22, 2008

gpshead self-assigned this Aug 18, 2008

gpshead closed this as completed Aug 18, 2008

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long object free list optimization #46297

Long object free list optimization #46297

tiran commented Feb 5, 2008

tiran commented Feb 5, 2008

tiran commented Feb 6, 2008

pitrou commented Feb 12, 2008

pitrou commented Feb 12, 2008

tiran commented Feb 12, 2008

pitrou commented Mar 22, 2008

gpshead commented Mar 22, 2008

gpshead commented Mar 23, 2008

gpshead commented Mar 23, 2008

pitrou commented Mar 23, 2008

pitrou commented Jun 30, 2008

gpshead commented Aug 18, 2008

gpshead commented Aug 18, 2008

Long object free list optimization #46297

Long object free list optimization #46297

Comments

tiran commented Feb 5, 2008

tiran commented Feb 5, 2008

tiran commented Feb 6, 2008

pitrou commented Feb 12, 2008

pitrou commented Feb 12, 2008

tiran commented Feb 12, 2008

pitrou commented Mar 22, 2008

gpshead commented Mar 22, 2008

gpshead commented Mar 23, 2008

gpshead commented Mar 23, 2008

pitrou commented Mar 23, 2008

pitrou commented Jun 30, 2008

gpshead commented Aug 18, 2008

gpshead commented Aug 18, 2008