This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author milko.krachounov
Recipients mark.dickinson, milko.krachounov, rhettinger, tebeka, terry.reedy
Date 2009-11-02.17:42:47
SpamBayes Score 5.55112e-17
Marked as misclassified No
Message-id <1257183773.5.0.98206846201.issue4356@psf.upfronthosting.co.za>
In-reply-to
Content
I've been bugged by the lack of key= argument for bisect for some time
now, and today I got to read this and the previous issues about the
matter. I still fail to understand the reasons for the rejections. It
might encourage bad design in which expensive key functions will get
called more than once for the same element. However:

1. Not all key= functions are computationally expensive, in fact a quick
search shows that most key= uses come are with itemgetter, attrgetter or
some cousin.
2. Even for a computationally expensive function precached key isn't
necessarily the right answer. And if the keys are huge, a key= function
can actually be useful in implementing intelligent caching.
3. The reason for rejection is merely a performance issue which *can* be
fixed without any significant changes to the design of the app should it
prove to be a real issue.


I was writing a module that keeps a sorted list of items in which items
can be added, removed, or changed (they might get reordered). I used
bisect to avoid useless slow linear searches, which would be one
obstacle for the app to scale well. However, the sort key can be changed
at any time, so implementing __lt__ wasn't an option. Keeping a second
list with the keys is both memory and computationally expensive as
inserting into a list is slow. It's not a shiny example of a bisect
use-case as the only thing I've used my module so far is to display the
files in a directory, which means the lists are small and the changes
are seldom, so any implementation is good enough.  But bisect with key=
would have been best if it was available. What's worse, I have a buggy
unreadable ad hoc implementation of bisect in my code right now. ;)

I see the following as reasons why key= would provide benefit:
1. If you have a sorted list, bisect is a net win. Having a key= would
enable you to utilize it without refactoring anything. The lack of key
may as well encourage you to continue using linear searches, or other
sub-optimal solutions.
2. Using key= is more readable and less error-prone than keeping two
lists in simple cases where defining a class with __le__ is an overkill.
Two examples I had where this was the case:
a) A class I implemented to pick numbers from a certain discrete random
distribution with bisect. Oh, but yeah, the implementation without key=
is twice faster.
b) I gave a hand to someone who was implementing a dictionary looked up
the words as you typed them with bisect.
3. Insort. Insort is already slow enough as it is, a key= wouldn't slow
it down much if at all. In fact, if you are keeping two lists, key=
would speed it up. Insort with key= is a net win.

I've implemented key= and quickly hacked a benchmark to show what
performance hit it might have, and to put things into perspective. The
results:

1. Cheap key function, integers and abs(), 1000000 items, 5000 bisects
or insorts:
a) Bisect with a key: 0.02205014s
b) Bisect with a second list: 0.01328707s
c) Insort with a key: 5.83688211s
d) Bisect with a second list, and two inserts: 16.17302299s

2. Cheap key function, ~4000 char bytestrings and len(), 100000 items,
5000 bisects or insorts:
a) Bisect with a key: 0.01829195s
b) Bisect with a second list: 0.00945401s
c) Insort with a key: 0.25511408s
d) Bisect with a second list, and two inserts: 0.49303603s

3. Expensive key function, ~4000 char bytestrings and str.lower(),
100000 (500 MB) items, 5000 bisects or insorts:
a) Bisect with a key: 1.26837015s
b) Bisect with a second list: 0.08390594s
c) Insort with a key: 1.50406289s
d) Bisect with a second list, and two inserts: 0.57737398s

4. Expensive key function, ~4000 char bytestrings and str.lower(),
500000 (2.5 GB) items, 5000 bisects or insorts:
a) Bisect with a key: 1.46136308s
b) Bisect with a second list: 0.08768606s
c) Insort with a key: 3.05218720s
d) Bisect with a second list, and two inserts: 6.43227386s

5. Expensive key function, ~4000 char bytestrings and str.strip(),
100000 (500 MB) items, 5000 bisects or insorts:
a) Bisect with a key: 0.03311396s
b) Bisect with a second list: 0.01374602s
c) Insort with a key: 0.27423000s
d) Bisect with a second list, and two inserts: 0.49585080s

6. Expensive key function, ~4000 char bytestrings and str.strip(),
500000 (2.5 GB) items, 5000 bisects or insorts:
a) Bisect with a key: 0.04530501
b) Bisect with a second list: 0.01912594
c) Insort with a key: 1.62209797
d) Bisect with a second list, and two inserts: 5.91734695

Also, I tried to bench linear searches, but as they had to run in Python
code they aren't representative of anything. In the integer test they
went for about 250 seconds without recalculating the key, and for about
500 with. Also, replacing them with .index() gave about 60 seconds if I
ensured there's high probability that the number is in the list, and for
500 if I didn't.

In short, key= for bisect would be convenient and neat, really useful in
rare cases, leading to more readable code in the most common cases, and
I'm unconvinced of the perceived harm that it would cause.

I attach a patch implementing it for trunk, py3k, and the benchmark
script I used to test it (on trunk).
History
Date User Action Args
2009-11-02 17:42:53milko.krachounovsetrecipients: + milko.krachounov, rhettinger, terry.reedy, tebeka, mark.dickinson
2009-11-02 17:42:53milko.krachounovsetmessageid: <1257183773.5.0.98206846201.issue4356@psf.upfronthosting.co.za>
2009-11-02 17:42:51milko.krachounovlinkissue4356 messages
2009-11-02 17:42:49milko.krachounovcreate