This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Speed up range() by caching and modifying long objects
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.5
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: larry Nosy List: BreamoreBoy, ethan.furman, gvanrossum, larry, lemburg, mark.dickinson, pitrou, rhettinger, scoder, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2015-05-06 16:18 by larry, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
larry.range.hack.1.txt larry, 2015-05-06 16:18 review
int_free_list.patch serhiy.storchaka, 2015-05-06 17:37 review
Messages (15)
msg242685 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2015-05-06 16:18
This probably shouldn't be checked in.  But it was an interesting experiment, and I did get it to work.

My brother forwarded me this question from Stack Overflow:
    http://stackoverflow.com/questions/23453133/is-there-a-reason-python-3-enumerates-slower-than-python-2

The debate brought up a good point: a lot of the overhead of range() is in creating and destroying the long objects.  I wondered, could I get rid of that?  Long story short, yes.

rangeiterobject is a special-case range object for when you're iterating over integer values and the values all fit inside C longs.  Otherwise it has to use the much slower general-purpose longrangeiterobject.)  rangeiter_next is simple: it computes the new value then returns PyLong_FromLong of that value.

First thought: cache a reference to the previous value.  If its reference count is 1, we have the only reference.  Overwrite its value and return it.  But that doesn't help in the general case, because in "for x in range(1000)" x is holding a reference at the time __next__ is called on the iterator.

The trick: cache *two* old yielded objects.  In the general case, by the second iteration, everyone else has dropped their references to the older cached object and we can modify it safely.

The second trick: if the value you're yielding is one of the interned "small ints", you *have* to return the interned small int.  (Otherwise you break 0 == 0, I kid you not.)

With this patch applied all regression tests pass.  And, on my desktop machine, the benchmark they used in the above link:

./python -mtimeit -n 5 -r 2 -s"cnt = 0" "for i in range(10000000): cnt += 1"

drops from 410ms to 318ms ("5 loops, best of 2: 318 msec per loop").


This implementation requires the rangeiterobject to have intimate knowledge of the implementation of the PyLongObject, including copy-and-pasting some information that isn't otherwise exposed (max small int essentially).  At the very least that information would need to be exposed properly so rangeiterobject could use it correctly before this could be checked in.

It might be cleaner for longobject.c to expose a private "set this long object to this value" function.  It would fail if we can't do it safely: if the value was an interned (small) int, or if the long was of the wrong size (Py_SIZE(o) != 1).


Is this interesting enough to pursue?  I'm happy to drop it.
msg242686 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-05-06 16:38
This regression was just discussed in issue24076. General suggestion about free list for small ints was proposed. It could speed up other integer computations, but comprehensive benchmarking results are needed.

See also similar issue23507 for tuples. Perhaps we need general solution for fast specialized free lists.
msg242687 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2015-05-06 16:38
See issue 24076.
msg242691 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-05-06 17:37
Here is a patch that adds a free list for 1-digit long objects.

$ ./python -m timeit -s "r = range(10**3)" -- "for i in r: pass"
Unpatched: 10000 loops, best of 3: 54.4 usec per loop
With free list: 10000 loops, best of 3: 38 usec per loop
With hacked range: 10000 loops, best of 3: 34.5 usec per loop
Python 2.7: 10000 loops, best of 3: 37.1 usec per loop

$ ./python -m timeit -s "r = list(range(10**3))" -- "for i in r: pass"
10000 loops, best of 3: 30.7 usec per loop

In Python 2.7:
$ ./python -m timeit -s "r = xrange(10**3)" -- "for i in r: pass"
10000 loops, best of 3: 41.4 usec per loop
$ ./python -m timeit -s "r = range(10**3)" -- "for i in r: pass"
10000 loops, best of 3: 37.1 usec per loop
msg242704 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2015-05-06 19:55
How does this apply where people like me scarcely if ever use range, it's standard for loops?  There are plenty of recipes using itertools that show that you don't need range, is this really needed?  No axe to grind, just curious.
msg242707 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2015-05-06 20:47
10**3 doesn't show off this hack as much as other numbers would; the hack only operates from 257 to the max in that will fit in a single long "digit" (32767 on 32-bit, 2**30 on 64-bit).

Anyway, freelist for one-digit longs seems nearly as good, and it's a lot more general-purpose, so I'm much more interested in that.
msg242885 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2015-05-11 08:57
I like the idea of adding a free list of longs in Python 3, but I think we should extend this somewhat to cover more ground, e.g. by pre-allocating a block of 1 digit long objects, like we did for Python 2 ints, and perhaps allocate up to 4k (= 1 memory page) towards such a free list.

The added cache locality of having the data in a pre-allocated block should provide some more performance.
msg242886 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-05-11 09:42
According to my and Larry's measurements [1] the distribution of created int's by size during running Python tests is:

On 32-bit Linux:
int                               42828741  13.40%
                              0     425353   0.99%   0.99%
                              1   21399290  49.96%  50.96%
                              2   10496856  24.51%  75.47%
                              3    4873346  11.38%  86.85%
                              4    1021563   2.39%  89.23%
                              5    1246444   2.91%  92.14%
                              6     733676   1.71%  93.85%
                              7     123074   0.29%  94.14%
                              8     139203   0.33%  94.47%
...

On 64-bit Linux:
int                              47600237  13.53%
                             0     294964   0.62%   0.62%
                             1   36135772  75.92%  76.53%
                             2    4504046   9.46%  86.00%
                             3    2109837   4.43%  90.43%
                             4    1277995   2.68%  93.11%
                             5     542775   1.14%  94.25%
                             6     485451   1.02%  95.27%
...

86% of ints have size <= 3 on 32-bit and <= 2 on 64-bit. This is enough to represent 32-bit integers (as in Python 2 int). I think we should support free list not only for 1-digit ints, but at least up to 3 digit on 32-bit build and up to 2 digits on 64-bit build. Other natural limit is 3 digit on 64-bit build (enough to represent 64-bit C long on Linux or pointer on all platforms). Larger integers perhaps are encountered mainly in tests.

Pre-allocating a block has a disadvantage. It is hard to free allocated block. The program can create a lot of integers, then drop most of them, and request the memory for other needs, but blocks once allocated for integers would not freed. This is not trivial design decision and should be discussed on Python-Dev and accepted by BDFL.
msg242887 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-05-11 09:43
[1] http://comments.gmane.org/gmane.comp.python.devel/153078
msg242888 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-05-11 10:09
I'd rather stick to the simple freelist approach. People interested in (allegedly) more ambitious designs can open new issues, together with patches to evaluate their efficiency.
msg242889 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2015-05-11 10:10
I'd rather have the general-purpose freelist for ints too.  How about we close this issue now, and assuming the freelist for ints goes in we can abandon this approach entirely.
msg242891 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2015-05-11 11:05
On 11.05.2015 11:42, Serhiy Storchaka wrote:
> 
> Pre-allocating a block has a disadvantage. It is hard to free allocated block. The program can create a lot of integers, then drop most of them, and request the memory for other needs, but blocks once allocated for integers would not freed. This is not trivial design decision and should be discussed on Python-Dev and accepted by BDFL.

True, but if it's only 1-4k RAM, I don't think anyone would mind :-)

Python 2 is doing exactly that with 1k RAM for the integer free
list. I think it was one of the first free lists ever added to
Python, so done in a time when RAM was expensive ;-).
msg242892 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-05-11 12:36
Well, 1-4k RAM can be consumed just after the start up (it's only 100-300 integers) and never freed. What next?
msg402137 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-09-18 17:44
Should we try the free list for integers again?
msg402138 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-09-18 17:55
(Never mind, that should have gone to https://bugs.python.org/issue24165, which is the integer freelist -- conclusion there in 2015 was also that it didn't do much.)
History
Date User Action Args
2022-04-11 14:58:16adminsetgithub: 68326
2021-09-18 17:55:25gvanrossumsetmessages: + msg402138
2021-09-18 17:44:32gvanrossumsetnosy: + gvanrossum
messages: + msg402137
2015-05-11 12:36:52serhiy.storchakasetmessages: + msg242892
2015-05-11 11:05:13lemburgsetmessages: + msg242891
2015-05-11 10:10:56larrysetstatus: open -> closed
resolution: rejected
messages: + msg242889

stage: patch review -> resolved
2015-05-11 10:09:17pitrousetmessages: + msg242888
2015-05-11 09:43:06serhiy.storchakasetmessages: + msg242887
2015-05-11 09:42:21serhiy.storchakasetmessages: + msg242886
2015-05-11 08:57:43lemburgsetnosy: + lemburg
messages: + msg242885
2015-05-07 20:48:43ethan.furmansetnosy: + ethan.furman
2015-05-06 20:47:57larrysetmessages: + msg242707
2015-05-06 19:55:59BreamoreBoysetnosy: + BreamoreBoy
messages: + msg242704
2015-05-06 17:37:42serhiy.storchakasetfiles: + int_free_list.patch
keywords: + patch
2015-05-06 17:37:23serhiy.storchakasetmessages: + msg242691
2015-05-06 16:38:52scodersetnosy: + scoder
messages: + msg242687
2015-05-06 16:38:08serhiy.storchakasetmessages: + msg242686
2015-05-06 16:28:20mark.dickinsonsetnosy: + mark.dickinson
2015-05-06 16:18:56larrycreate