This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: patch for making list/insert at the top of the list avoid memmoves
Type: performance Stage: patch review
Components: Interpreter Core Versions: Python 3.2, Python 2.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: Rhamphoryncus, Steve Howell, flox, jcea, rhettinger, stutzbach
Priority: normal Keywords: patch

Created on 2010-01-26 10:28 by Steve Howell, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
listobject_c_for_py3k.patch Steve Howell, 2010-01-26 10:28 listobject.c patch to give O(1) performance on list.pop(0)
issue7784_listobject_perf.diff flox, 2010-01-26 11:52 Patch, apply to trunk
no_mem_penalty.diff Steve Howell, 2010-01-29 10:25 fixed version of no-mem-penalty patch
Messages (11)
msg98324 - (view) Author: showell (Steve Howell) Date: 2010-01-26 10:28
I am attaching a patch to improve the performance of list operations that insert or delete elements at the front of the list.  The patch has had some discussion on python-dev in the thread entitled "patch to make list.pop(0) work in O(1) time."

So far the verdict is not to touch list internals to achieve this performance gain, but I am looking to improve the patch regardless and possibly include it in a PEP that documents the decision not to incorporate the patch, with the hope that it either prevents future duplication of effort or eventually gets accepted.

At a bare minimum, the patch needs extensive code review, as it touches a lot of internals, and it needs stylistic cleanup.  But it does pass all the tests, and it has been benchmarked to show 100X speed improvement on a small test case.
msg98328 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-01-26 11:52
Steve, thank you for your patch.

IMHO, it's better to provide a patch against trunk (Py2) when the optimization is not specific to 3.x.

I merged it manually on my working copy, and all tests pass without leaking. Patch for Py2 attached.

However, I am not an expert. I just cleaned the tab/space mix to conform with what is used in the listobject.c file.
msg98329 - (view) Author: showell (Steve Howell) Date: 2010-01-26 11:58
--- On Tue, 1/26/10, Florent Xicluna <report@bugs.python.org> wrote:

> From: Florent Xicluna <report@bugs.python.org>
> Subject: [issue7784] patch for making list/insert at the top of the list avoid memmoves
> To: showell30@yahoo.com
> Date: Tuesday, January 26, 2010, 3:53 AM
> 
> Florent Xicluna <laxyf@yahoo.fr>
> added the comment:
> 
> Steve, thank you for your patch.
> 
> IMHO, it's better to provide a patch against trunk (Py2)
> when the optimization is not specific to 3.x.
> 
> I merged it manually on my working copy, and all tests pass
> without leaking. Patch for Py2 attached.
> 
> However, I am not an expert. I just cleaned the tab/space
> mix to conform with what is used in the listobject.c file.
> 

Thank you!  My only rationale for applying the patch to 3.x was that it would minimize backward compatibility concerns, but I do not have a strong opinion either way.

You probably noticed a few stylistic mistakes, such as these lines:

            a->orphans += (-1 * d);
            a->ob_item += (-1 * d);

All feedback on ways to improve this code would be greatly welcome.

Thanks again.
msg98383 - (view) Author: showell (Steve Howell) Date: 2010-01-26 21:52
The next stage for the patch is that I need to switch from using orphans count to using ob_allocated pointer.
msg98390 - (view) Author: Adam Olsen (Rhamphoryncus) Date: 2010-01-26 23:34
$ ./python -m timeit -s 'from collections import deque; c = deque(range(1000000))' 'c.append(c.popleft())'
1000000 loops, best of 3: 0.29 usec per loop

$ ./python -m timeit -s 'c = range(1000000)' 'c.append(c.pop(0))'
1000000 loops, best of 3: 0.424 usec per loop

Using flox's issue7784_listobject_perf.diff.  Significantly slower, but it does scale linearly.


$ ./python -m timeit -s 'c = range(1000000)' 'c.insert(0, c.pop())'
100 loops, best of 3: 3.39 msec per loop

Unfortunately inserting does not.  Will future patches attempt to address this?

Note that, if it ends up slower than list and slower than deque there isn't really a use case for it.
msg98391 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-01-27 00:01
Please take care in presenting timings.  It is easy to get misleading results:

* Factor-out the attribute lookup time (since that isn't the part being changed:

  -s 'c = range(1000000); c_ins=c.insert; c_pop=c.pop' 
  'c_insert(0, c_pop())'

* The case of alternately prepending and the popping the 0th element is atypical.  Trying prepending many, then popping many.   Trying popping first, then prepending.

* The patch needs to be throughly exercised so that we cover all of the expected behaviors (do many prepends trigger memmoves, do many prepops leave tons of unused space, what happens with the normal case of a moving queue where the average size varies but hovers around a typical mean).

A decision cannot be based on exercising a small fragment of a data structure on just the things it is good at.  What happens to other typical list behavior and lookup times, etc?
msg98430 - (view) Author: showell (Steve Howell) Date: 2010-01-27 14:14
I will attempt to fix insert(0, pop) in the next patch I submit (using pointer vs. orphans count).

Just so we are clear on the performance that I'm promising, I expect benchmarks to prove out these facts:

  1) New-list will always perform comparably to old-list, particularly for access.
  2) New-list will outperform old-list for pre-pop.
  3) New-list will perform comparably to deque for pre-pop.
  4) Although not a super common use case, prepends that follow prepops should reclaim space and perform better than old list and comparably to deque.
  5) For prepends that exceed prepops, I expect deque to greatly outperform new lists, just as it did old lists.

As I explained on python-dev, the new patch should give lists the same performance profile as a pencil/paper todo list.  Prepop works in O(1) with occasional compacting.  Prepend generally works in O(N) unless you can reclaim space from prior prepops.

To measure memory wastage, the two worst case scenarios are a bunch of tiny lists or a list from you which you prepop just under 50% of the elements (although that 50% can be tuned later if that's the only showstopper).  You can certainly construct a benchmark that brings those negative aspects to light, but I think it would be more convincing to find an actual real-world program that performs worse or exhausts memory under the limitation.

In general, it would be nice to find a Python program that makes extensive use of lists to prove out that the new list implementation at least does not degrade performance.  To the extent that test suite all passes, we at least have some empirical evidence that "correctness" has not been jeopardized, but the more we can hammer on this, the better, of course.  Does anybody have a suggestion for a real-world list benchmark program?
msg98431 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-01-27 14:19
For macro benchmarking, I suggest using the Unladen Swallow benchmark suite.
msg98501 - (view) Author: showell (Steve Howell) Date: 2010-01-29 10:05
I am attaching a new patch that does not add a new element to PyListObject, roughly following a technique that Antoine Pitrou suggested on python-dev.  When I want to lazily avoid a memmove under the new patch, I set the MSB on allocated and store the original ob_item pointer in the new ob_item[-1].

On the advice of Daniel, I ran the new patch against the Unladen benchmark suite.  The results were pretty neutral--never more than a 1% penalty but no significant gains either.  I did not expect to see gains, for the obvious reason that I am improving performance on an operation that folks have been encouraged to work around.  The new patch continues to do well on microbenchmarks. 

The new patch fails one test in test_sys related to the 12 byte garbage collection header.  The failure is definitely introduced by my patch, but I am not sure what it's doing wrong.  All other tests pass.

Because the new code piggybacks on top of of allocated instead of creating a new variable in PyListObject, the new code is a bit more complex than the original patch, which is unfortunate.  There are probably some opportunities for making the new code simpler.
msg98502 - (view) Author: showell (Steve Howell) Date: 2010-01-29 10:25
Ok, found the offending line, now all tests pass.  

The use case where this patch will pay off the most is slicing your way through a list of tasks.  The toy program below gets about a 50x speedup.

	import time

	n = 800000

	lst = []
	for i in range(n):
		lst.append(i)

	t = time.time()
	for i in range(n):
		x = lst[:10]
		del lst[:10]

	print('time = ' + str(time.time() - t))
	print(len(lst))
msg98532 - (view) Author: showell (Steve Howell) Date: 2010-01-29 19:31
I am closing this due to mostly unanimous rejection on python-dev.

If folks reopen this in the future, the last patch that I submitted has been reasonably well tested, but it has not been code reviewed.

The 1% speed penalty could probably be driven down without too much effort, but not totally eliminated.

The constraint not to add a member to PyListObject complicates the code considerably.

In diving deep into listobject.c, I noticed a couple things unrelated to my patch:

  1) There might be a refactoring available for calls to list_resize, where callers just pass in the delta, instead of the new list size.  Lots of the callers do addition that could be pushed into list_resize.  Not sure it would lead to speed ups, but it would probably reduce code size.

  2) The optimistic realloc scheme is probably needlessly wasteful for really large lists.  If you have a million elements, I am not sure you need to allocate 12% extra space, and I think that's what the current algorithm does (untested).

  3) There might be some merit in splitting out list_resize into list_resize_bigger and list_resize_smaller.  I think the callers generally know when they are shrinking/expanding the list, so callers that are shrinking the list could call a leaner method that just shrinks the list without having to execute other instructions.
History
Date User Action Args
2022-04-11 14:56:56adminsetgithub: 52032
2010-03-24 18:10:34rhettingersetresolution: rejected
2010-03-24 16:56:41jceasetnosy: + jcea
2010-01-29 19:31:03Steve Howellsetstatus: open -> closed

messages: + msg98532
2010-01-29 10:26:21Steve Howellsetfiles: - list_top_no_extra_mem.diff
2010-01-29 10:25:57Steve Howellsetfiles: + no_mem_penalty.diff

messages: + msg98502
2010-01-29 10:05:04Steve Howellsetfiles: + list_top_no_extra_mem.diff

messages: + msg98501
2010-01-27 14:19:19stutzbachsetmessages: + msg98431
2010-01-27 14:14:11Steve Howellsetmessages: + msg98430
2010-01-27 00:01:44rhettingersetnosy: + rhettinger
messages: + msg98391
2010-01-26 23:34:07Rhamphoryncussetnosy: + Rhamphoryncus
messages: + msg98390
2010-01-26 21:52:46Steve Howellsetmessages: + msg98383
2010-01-26 21:11:57stutzbachsetnosy: + stutzbach
2010-01-26 11:58:25Steve Howellsetmessages: + msg98329
2010-01-26 11:53:00floxsetfiles: + issue7784_listobject_perf.diff

versions: + Python 2.7
keywords: + patch
nosy: + flox

messages: + msg98328
stage: patch review
2010-01-26 10:28:03Steve Howellcreate