Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long object free list optimization #46297

Closed
tiran opened this issue Feb 5, 2008 · 13 comments
Closed

Long object free list optimization #46297

tiran opened this issue Feb 5, 2008 · 13 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@tiran
Copy link
Member

tiran commented Feb 5, 2008

BPO 2013
Nosy @gpshead, @pitrou, @tiran
Files
  • long_freelist.patch
  • py3k_longfreelist.patch
  • py3k_longfreelist2.patch
  • py3k_longfreelist2-gps01.patch: improvement over longfreelist2
  • py3k_longfreelist2-gps02.patch
  • issue2013-benchmarks-gps02.txt: benchmarks of the -gps02 patch
  • py3k_longfreelist2_gps03.patch: updated, now works in debug mode.
  • py3k_longfreelist2_gps04.patch: resimplified for only 1 digit longs
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/gpshead'
    closed_at = <Date 2008-08-18.04:56:17.247>
    created_at = <Date 2008-02-05.03:07:46.663>
    labels = ['interpreter-core', 'performance']
    title = 'Long object free list optimization'
    updated_at = <Date 2008-08-18.04:56:57.079>
    user = 'https://github.com/tiran'

    bugs.python.org fields:

    activity = <Date 2008-08-18.04:56:57.079>
    actor = 'gregory.p.smith'
    assignee = 'gregory.p.smith'
    closed = True
    closed_date = <Date 2008-08-18.04:56:17.247>
    closer = 'gregory.p.smith'
    components = ['Interpreter Core']
    creation = <Date 2008-02-05.03:07:46.663>
    creator = 'christian.heimes'
    dependencies = []
    files = ['9355', '9361', '9421', '9817', '9824', '9825', '11143', '11144']
    hgrepos = []
    issue_num = 2013
    keywords = ['patch']
    message_count = 13.0
    messages = ['62060', '62106', '62338', '62339', '62342', '64331', '64337', '64351', '64352', '64373', '69018', '71306', '71314']
    nosy_count = 3.0
    nosy_names = ['gregory.p.smith', 'pitrou', 'christian.heimes']
    pr_nums = []
    priority = 'normal'
    resolution = 'rejected'
    stage = None
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue2013'
    versions = ['Python 3.0']

    @tiran
    Copy link
    Member Author

    tiran commented Feb 5, 2008

    The patch adds a free list of long objects with size 1 or -1. It has
    quite a tremendous effect on code like list(range(1000)).

    $ ./python -m timeit "for i in range(100): list(range(1000))"

    Without patch:
    10 loops, best of 3: 79 msec per loop

    With patch:
    10 loops, best of 3: 20.8 msec per loop

    @tiran tiran added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Feb 5, 2008
    @tiran
    Copy link
    Member Author

    tiran commented Feb 6, 2008

    The updated patch limits the free list.

    @tiran tiran added the type-feature A feature request or enhancement label Feb 6, 2008
    @pitrou
    Copy link
    Member

    pitrou commented Feb 12, 2008

    I don't get the same impressive speedup as you do, although it's still
    significant.

    $ ./python -m timeit "for i in range(100): list(range(1000))"
    Without patch:
    100 loops, best of 3: 5.05 msec per loop
    With patch:
    100 loops, best of 3: 3.57 msec per loop

    Also, your patch is leaky. I'm attaching a fixed version (for py3k,
    didn't check the trunk version).

    @pitrou
    Copy link
    Member

    pitrou commented Feb 12, 2008

    Christian, maybe you did your measurements with the DEBUG flag turned
    on? That would explain the discrepancy.

    Also, I'm not sure the patch is useful for 2.x since long objects with
    size -1 or 1 should be represented as ints instead :-)

    @tiran
    Copy link
    Member Author

    tiran commented Feb 12, 2008

    Antoine Pitrou wrote:

    Christian, maybe you did your measurements with the DEBUG flag turned
    on? That would explain the discrepancy.

    Also, I'm not sure the patch is useful for 2.x since long objects with
    size -1 or 1 should be represented as ints instead :-)

    Yes, I've used a Py_Debug build to measure the speed difference.

    You are right. The patch makes no sense for the 2.x series.

    Christian

    @pitrou
    Copy link
    Member

    pitrou commented Mar 22, 2008

    Here are some updated timings against the current py3k branch:

    $ ./python -m timeit "sum(range(10000))"

    Without patch:
    1000 loops, best of 3: 675 usec per loop
    With patch:
    1000 loops, best of 3: 462 usec per loop

    $ ./python -m timeit -s "n=1000000" "sum(range(n, n+10000))"

    Without patch:
    1000 loops, best of 3: 1.36 msec per loop
    With patch:
    1000 loops, best of 3: 1.38 msec per loop

    $ ./python -m timeit "min(range(255))"

    Without patch:
    10000 loops, best of 3: 18.7 usec per loop
    With patch:
    10000 loops, best of 3: 19.4 usec per loop

    As you can see the patch makes things quite a bit better for 1-digit
    long objects, and there is only a small slowdown for longer or tiny
    ints. Given that 1-digit long objects should be prevalent in most code I
    think it's probably a winner overall.

    @gpshead
    Copy link
    Member

    gpshead commented Mar 22, 2008

    I'm attaching a slightly improved version of the longfreelist2 patch.
    It moves the important test to be first and removes the unneeded ? :
    conditional.

    ........................
    Here are some more benchmarks:
    ........................

    --- issue2013-without-patch.txt	2008-03-22 12:18:37.000000000 -0700
    +++ issue2013-longfreelist2-gps01-size-first.txt	2008-03-22
    12:34:44.000000000 -0700
    @@ -1,43 +1,43 @@
    -3.0a3+ (py3k:61746M, Mar 22 2008, 12:12:29) 
    +3.0a3+ (py3k:61746M, Mar 22 2008, 12:33:47) 
     [GCC 4.0.1 (Apple Computer, Inc. build 5367)]
     
     ./python.exe -m timeit min(range(255))
     
    -100000 loops, best of 3: 17.2 usec per loop
    +100000 loops, best of 3: 15.8 usec per loop
     
     ./python.exe -m timeit -s n=1000000 sum(range(n,n+10000))
     
    -1000 loops, best of 3: 1.21 msec per loop
    +1000 loops, best of 3: 1.22 msec per loop
     
     ./python.exe -m timeit sum(range(-32768,32768))
     
    -100 loops, best of 3: 3.98 msec per loop
    +100 loops, best of 3: 2.63 msec per loop
     
     ./python.exe -m timeit sum(range(-1000000,1000000))
     
    -10 loops, best of 3: 296 msec per loop
    +10 loops, best of 3: 273 msec per loop
     
     ./python.exe -m timeit sum(range(4000))
     
    -1000 loops, best of 3: 239 usec per loop
    +10000 loops, best of 3: 151 usec per loop
     
     ./python.exe -m timeit sum(list(range(4000)))
     
    -1000 loops, best of 3: 360 usec per loop
    +1000 loops, best of 3: 274 usec per loop
     
     ./python.exe -m timeit sum(range(10000))
     
    -1000 loops, best of 3: 615 usec per loop
    +1000 loops, best of 3: 382 usec per loop
     
     ./python.exe -m timeit sum(list(range(10000)))
     
    -1000 loops, best of 3: 887 usec per loop
    +1000 loops, best of 3: 802 usec per loop
     
     ./python.exe -m timeit sum(range(40000))
     
    -100 loops, best of 3: 2.55 msec per loop
    +100 loops, best of 3: 1.79 msec per loop
     
     ./python.exe -m timeit sum(range(80000))
     
    -100 loops, best of 3: 6.4 msec per loop
    +100 loops, best of 3: 5.77 msec per loop
     

    ........................
    And a benchmarks of the longfreelist2 patch before my improvement:
    ........................

    --- issue2013-without-patch.txt	2008-03-22 12:18:37.000000000 -0700
    +++ issue2013-longfreelist2.txt	2008-03-22 12:19:46.000000000 -0700
    @@ -1,43 +1,43 @@
    -3.0a3+ (py3k:61746M, Mar 22 2008, 12:12:29) 
    +3.0a3+ (py3k:61746M, Mar 22 2008, 12:18:57) 
     [GCC 4.0.1 (Apple Computer, Inc. build 5367)]
     
     ./python.exe -m timeit min(range(255))
     
    -100000 loops, best of 3: 17.2 usec per loop
    +100000 loops, best of 3: 16.1 usec per loop
     
     ./python.exe -m timeit -s n=1000000 sum(range(n,n+10000))
     
    -1000 loops, best of 3: 1.21 msec per loop
    +1000 loops, best of 3: 1.25 msec per loop
     
     ./python.exe -m timeit sum(range(-32768,32768))
     
    -100 loops, best of 3: 3.98 msec per loop
    +100 loops, best of 3: 2.95 msec per loop
     
     ./python.exe -m timeit sum(range(-1000000,1000000))
     
    -10 loops, best of 3: 296 msec per loop
    +10 loops, best of 3: 283 msec per loop
     
     ./python.exe -m timeit sum(range(4000))
     
    -1000 loops, best of 3: 239 usec per loop
    +1000 loops, best of 3: 177 usec per loop
     
     ./python.exe -m timeit sum(list(range(4000)))
     
    -1000 loops, best of 3: 360 usec per loop
    +1000 loops, best of 3: 276 usec per loop
     
     ./python.exe -m timeit sum(range(10000))
     
    -1000 loops, best of 3: 615 usec per loop
    +1000 loops, best of 3: 453 usec per loop
     
     ./python.exe -m timeit sum(list(range(10000)))
     
    -1000 loops, best of 3: 887 usec per loop
    +1000 loops, best of 3: 792 usec per loop
     
     ./python.exe -m timeit sum(range(40000))
     
    -100 loops, best of 3: 2.55 msec per loop
    +100 loops, best of 3: 2.03 msec per loop
     
     ./python.exe -m timeit sum(range(80000))
     
    -100 loops, best of 3: 6.4 msec per loop
    +100 loops, best of 3: 6.01 msec per loop
     
    
    ........................
    And diffs of longfreelist2 vs longfreelist2-gps01:
    ........................
    --- issue2013-longfreelist2.txt	2008-03-22 12:19:46.000000000 -0700
    +++ issue2013-longfreelist2-gps01-size-first.txt	2008-03-22
    12:34:44.000000000 -0700
    @@ -1,43 +1,43 @@
    -3.0a3+ (py3k:61746M, Mar 22 2008, 12:18:57) 
    +3.0a3+ (py3k:61746M, Mar 22 2008, 12:33:47) 
     [GCC 4.0.1 (Apple Computer, Inc. build 5367)]
     
     ./python.exe -m timeit min(range(255))
     
    -100000 loops, best of 3: 16.1 usec per loop
    +100000 loops, best of 3: 15.8 usec per loop
     
     ./python.exe -m timeit -s n=1000000 sum(range(n,n+10000))
     
    -1000 loops, best of 3: 1.25 msec per loop
    +1000 loops, best of 3: 1.22 msec per loop
     
     ./python.exe -m timeit sum(range(-32768,32768))
     
    -100 loops, best of 3: 2.95 msec per loop
    +100 loops, best of 3: 2.63 msec per loop
     
     ./python.exe -m timeit sum(range(-1000000,1000000))
     
    -10 loops, best of 3: 283 msec per loop
    +10 loops, best of 3: 273 msec per loop
     
     ./python.exe -m timeit sum(range(4000))
     
    -1000 loops, best of 3: 177 usec per loop
    +10000 loops, best of 3: 151 usec per loop
     
     ./python.exe -m timeit sum(list(range(4000)))
     
    -1000 loops, best of 3: 276 usec per loop
    +1000 loops, best of 3: 274 usec per loop
     
     ./python.exe -m timeit sum(range(10000))
     
    -1000 loops, best of 3: 453 usec per loop
    +1000 loops, best of 3: 382 usec per loop
     
     ./python.exe -m timeit sum(list(range(10000)))
     
    -1000 loops, best of 3: 792 usec per loop
    +1000 loops, best of 3: 802 usec per loop
     
     ./python.exe -m timeit sum(range(40000))
     
    -100 loops, best of 3: 2.03 msec per loop
    +100 loops, best of 3: 1.79 msec per loop
     
     ./python.exe -m timeit sum(range(80000))
     
    -100 loops, best of 3: 6.01 msec per loop
    +100 loops, best of 3: 5.77 msec per loop

    @gpshead gpshead added performance Performance or resource usage and removed type-feature A feature request or enhancement labels Mar 22, 2008
    @gpshead
    Copy link
    Member

    gpshead commented Mar 23, 2008

    Looking at how much memory was actually used by PyLongObjects the way
    our code allocates them... we can use this freelist for 2 digit PyLongs
    on 32-bit systems and for 4 digit PyLongs on 64-bit systems. Doing this
    speeds things up even more as numbers > 32767 are still quite common.

    why? sizeof(PyVarObject) == 12, sizeof(digit) == 2. Objects/obmalloc.c
    allocates on 8, 16 and 32 byte boundaries. Theres no such thing as a 14
    byte allocation, its 16 byte aligned. The same applies for 64-bit
    platforms where the sizeof(PyVarObject) == 24, our allocation size is 32
    bytes there.

    patch attached as -gps02.

    @gpshead
    Copy link
    Member

    gpshead commented Mar 23, 2008

    And the benchmarks of gps02... see the attached file.

    At the end I had it run pybench rather than these microbenchmarks that
    obviously show the speedup, the -gps02 patch ran 1.3% faster best case a
    0.5% faster on average (32-bit; my mac's toolchain can't build enough of
    py3k 64-bit to run pybench).

    Things left for someone to do? Determine what a good size for the
    PyLong free list is. 4096 was a magic number chosen by tiran.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 23, 2008

    The problem with choosing a sensible freelist size is that we don't have
    any reference workloads. However, I just tested with 10000 and it
    doesn't seem to slow anything down anyway. It doesn't make our
    microbenchmarks

    I thought the patch to compact freelists at each full gc collection had
    been committed, but it doesn't seem there. Perhaps it will change
    matters quite a bit. On the one hand, it will allow for bigger freelists
    with less worries of degrading memory footprint (but still, potential
    cache pollution). On the other hand, the bigger the freelists, the more
    expensive it is to deallocate them.

    @pitrou
    Copy link
    Member

    pitrou commented Jun 30, 2008

    Gregory, your patch probably deserves checking in, it doesn't seem to me
    there is any concern preventing you to do that.

    @gpshead
    Copy link
    Member

    gpshead commented Aug 18, 2008

    preventing this right now is that when i apply this to py3k today, it
    fails miserably in a debug build.

    first, my patch has an invalid assert(size > 0) in it in _PyLong_New as
    0 size is used when the value is 0. get rid of that line.

    then things at least run but you'll end up in an infinite loop when the
    interpreter exits at best if you've compiled in debug mode.

    things work great in a non-pydebug build.

    i believe the reason is this change is not properly looking at the
    structure allocation sizes. debug builds add extra structure fields.

    i'm investigating. the free_list code in floatobject.c does not have
    this problem so at least there's a good example to go from.

    and yet another reason why a more general free list library for various
    internals to use would be useful...

    @gpshead gpshead self-assigned this Aug 18, 2008
    @gpshead
    Copy link
    Member

    gpshead commented Aug 18, 2008

    attached is an updated patch that also works in debug mode. to do this
    i had to exclude all zero length (value = 0) PyLongObjects from the free
    list. i'm still not sure why, they all have the same underlying
    allocation size.

    simplifying it to only freelist one digit longs the only useful
    differences are in the microbenchmark of -32767..32768. otherwise
    things are slightly slower. pybench is identical and pystone drops a bit.

    i'm closing this as not worth it as things are written. if someone
    wants to pick up the idea and play with freelists go ahead but this
    doesn't need to be an open tracker issue. there is room for improvement
    but these patches aren't it.

    @gpshead gpshead closed this as completed Aug 18, 2008
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants