Message 94836 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	milko.krachounov
Recipients	mark.dickinson, milko.krachounov, rhettinger, tebeka, terry.reedy
Date	2009-11-02.17:42:47
SpamBayes Score	5.551115e-17
Marked as misclassified	No
Message-id	<1257183773.5.0.98206846201.issue4356@psf.upfronthosting.co.za>
In-reply-to

Content
I've been bugged by the lack of key= argument for bisect for some time now, and today I got to read this and the previous issues about the matter. I still fail to understand the reasons for the rejections. It might encourage bad design in which expensive key functions will get called more than once for the same element. However: 1. Not all key= functions are computationally expensive, in fact a quick search shows that most key= uses come are with itemgetter, attrgetter or some cousin. 2. Even for a computationally expensive function precached key isn't necessarily the right answer. And if the keys are huge, a key= function can actually be useful in implementing intelligent caching. 3. The reason for rejection is merely a performance issue which can be fixed without any significant changes to the design of the app should it prove to be a real issue. I was writing a module that keeps a sorted list of items in which items can be added, removed, or changed (they might get reordered). I used bisect to avoid useless slow linear searches, which would be one obstacle for the app to scale well. However, the sort key can be changed at any time, so implementing __lt__ wasn't an option. Keeping a second list with the keys is both memory and computationally expensive as inserting into a list is slow. It's not a shiny example of a bisect use-case as the only thing I've used my module so far is to display the files in a directory, which means the lists are small and the changes are seldom, so any implementation is good enough. But bisect with key= would have been best if it was available. What's worse, I have a buggy unreadable ad hoc implementation of bisect in my code right now. ;) I see the following as reasons why key= would provide benefit: 1. If you have a sorted list, bisect is a net win. Having a key= would enable you to utilize it without refactoring anything. The lack of key may as well encourage you to continue using linear searches, or other sub-optimal solutions. 2. Using key= is more readable and less error-prone than keeping two lists in simple cases where defining a class with __le__ is an overkill. Two examples I had where this was the case: a) A class I implemented to pick numbers from a certain discrete random distribution with bisect. Oh, but yeah, the implementation without key= is twice faster. b) I gave a hand to someone who was implementing a dictionary looked up the words as you typed them with bisect. 3. Insort. Insort is already slow enough as it is, a key= wouldn't slow it down much if at all. In fact, if you are keeping two lists, key= would speed it up. Insort with key= is a net win. I've implemented key= and quickly hacked a benchmark to show what performance hit it might have, and to put things into perspective. The results: 1. Cheap key function, integers and abs(), 1000000 items, 5000 bisects or insorts: a) Bisect with a key: 0.02205014s b) Bisect with a second list: 0.01328707s c) Insort with a key: 5.83688211s d) Bisect with a second list, and two inserts: 16.17302299s 2. Cheap key function, ~4000 char bytestrings and len(), 100000 items, 5000 bisects or insorts: a) Bisect with a key: 0.01829195s b) Bisect with a second list: 0.00945401s c) Insort with a key: 0.25511408s d) Bisect with a second list, and two inserts: 0.49303603s 3. Expensive key function, ~4000 char bytestrings and str.lower(), 100000 (500 MB) items, 5000 bisects or insorts: a) Bisect with a key: 1.26837015s b) Bisect with a second list: 0.08390594s c) Insort with a key: 1.50406289s d) Bisect with a second list, and two inserts: 0.57737398s 4. Expensive key function, ~4000 char bytestrings and str.lower(), 500000 (2.5 GB) items, 5000 bisects or insorts: a) Bisect with a key: 1.46136308s b) Bisect with a second list: 0.08768606s c) Insort with a key: 3.05218720s d) Bisect with a second list, and two inserts: 6.43227386s 5. Expensive key function, ~4000 char bytestrings and str.strip(), 100000 (500 MB) items, 5000 bisects or insorts: a) Bisect with a key: 0.03311396s b) Bisect with a second list: 0.01374602s c) Insort with a key: 0.27423000s d) Bisect with a second list, and two inserts: 0.49585080s 6. Expensive key function, ~4000 char bytestrings and str.strip(), 500000 (2.5 GB) items, 5000 bisects or insorts: a) Bisect with a key: 0.04530501 b) Bisect with a second list: 0.01912594 c) Insort with a key: 1.62209797 d) Bisect with a second list, and two inserts: 5.91734695 Also, I tried to bench linear searches, but as they had to run in Python code they aren't representative of anything. In the integer test they went for about 250 seconds without recalculating the key, and for about 500 with. Also, replacing them with .index() gave about 60 seconds if I ensured there's high probability that the number is in the list, and for 500 if I didn't. In short, key= for bisect would be convenient and neat, really useful in rare cases, leading to more readable code in the most common cases, and I'm unconvinced of the perceived harm that it would cause. I attach a patch implementing it for trunk, py3k, and the benchmark script I used to test it (on trunk).

I've been bugged by the lack of key= argument for bisect for some time
now, and today I got to read this and the previous issues about the
matter. I still fail to understand the reasons for the rejections. It
might encourage bad design in which expensive key functions will get
called more than once for the same element. However:

1. Not all key= functions are computationally expensive, in fact a quick
search shows that most key= uses come are with itemgetter, attrgetter or
some cousin.
2. Even for a computationally expensive function precached key isn't
necessarily the right answer. And if the keys are huge, a key= function
can actually be useful in implementing intelligent caching.
3. The reason for rejection is merely a performance issue which *can* be
fixed without any significant changes to the design of the app should it
prove to be a real issue.


I was writing a module that keeps a sorted list of items in which items
can be added, removed, or changed (they might get reordered). I used
bisect to avoid useless slow linear searches, which would be one
obstacle for the app to scale well. However, the sort key can be changed
at any time, so implementing __lt__ wasn't an option. Keeping a second
list with the keys is both memory and computationally expensive as
inserting into a list is slow. It's not a shiny example of a bisect
use-case as the only thing I've used my module so far is to display the
files in a directory, which means the lists are small and the changes
are seldom, so any implementation is good enough.  But bisect with key=
would have been best if it was available. What's worse, I have a buggy
unreadable ad hoc implementation of bisect in my code right now. ;)

I see the following as reasons why key= would provide benefit:
1. If you have a sorted list, bisect is a net win. Having a key= would
enable you to utilize it without refactoring anything. The lack of key
may as well encourage you to continue using linear searches, or other
sub-optimal solutions.
2. Using key= is more readable and less error-prone than keeping two
lists in simple cases where defining a class with __le__ is an overkill.
Two examples I had where this was the case:
a) A class I implemented to pick numbers from a certain discrete random
distribution with bisect. Oh, but yeah, the implementation without key=
is twice faster.
b) I gave a hand to someone who was implementing a dictionary looked up
the words as you typed them with bisect.
3. Insort. Insort is already slow enough as it is, a key= wouldn't slow
it down much if at all. In fact, if you are keeping two lists, key=
would speed it up. Insort with key= is a net win.

I've implemented key= and quickly hacked a benchmark to show what
performance hit it might have, and to put things into perspective. The
results:

1. Cheap key function, integers and abs(), 1000000 items, 5000 bisects
or insorts:
a) Bisect with a key: 0.02205014s
b) Bisect with a second list: 0.01328707s
c) Insort with a key: 5.83688211s
d) Bisect with a second list, and two inserts: 16.17302299s

2. Cheap key function, ~4000 char bytestrings and len(), 100000 items,
5000 bisects or insorts:
a) Bisect with a key: 0.01829195s
b) Bisect with a second list: 0.00945401s
c) Insort with a key: 0.25511408s
d) Bisect with a second list, and two inserts: 0.49303603s

3. Expensive key function, ~4000 char bytestrings and str.lower(),
100000 (500 MB) items, 5000 bisects or insorts:
a) Bisect with a key: 1.26837015s
b) Bisect with a second list: 0.08390594s
c) Insort with a key: 1.50406289s
d) Bisect with a second list, and two inserts: 0.57737398s

4. Expensive key function, ~4000 char bytestrings and str.lower(),
500000 (2.5 GB) items, 5000 bisects or insorts:
a) Bisect with a key: 1.46136308s
b) Bisect with a second list: 0.08768606s
c) Insort with a key: 3.05218720s
d) Bisect with a second list, and two inserts: 6.43227386s

5. Expensive key function, ~4000 char bytestrings and str.strip(),
100000 (500 MB) items, 5000 bisects or insorts:
a) Bisect with a key: 0.03311396s
b) Bisect with a second list: 0.01374602s
c) Insort with a key: 0.27423000s
d) Bisect with a second list, and two inserts: 0.49585080s

6. Expensive key function, ~4000 char bytestrings and str.strip(),
500000 (2.5 GB) items, 5000 bisects or insorts:
a) Bisect with a key: 0.04530501
b) Bisect with a second list: 0.01912594
c) Insort with a key: 1.62209797
d) Bisect with a second list, and two inserts: 5.91734695

Also, I tried to bench linear searches, but as they had to run in Python
code they aren't representative of anything. In the integer test they
went for about 250 seconds without recalculating the key, and for about
500 with. Also, replacing them with .index() gave about 60 seconds if I
ensured there's high probability that the number is in the list, and for
500 if I didn't.

In short, key= for bisect would be convenient and neat, really useful in
rare cases, leading to more readable code in the most common cases, and
I'm unconvinced of the perceived harm that it would cause.

I attach a patch implementing it for trunk, py3k, and the benchmark
script I used to test it (on trunk).

History
Date	User	Action	Args
2009-11-02 17:42:53	milko.krachounov	set	recipients: + milko.krachounov, rhettinger, terry.reedy, tebeka, mark.dickinson
2009-11-02 17:42:53	milko.krachounov	set	messageid: <1257183773.5.0.98206846201.issue4356@psf.upfronthosting.co.za>
2009-11-02 17:42:51	milko.krachounov	link	issue4356 messages
2009-11-02 17:42:49	milko.krachounov	create