classification
Title: Add a key parameter (like sorted) to heapq.merge
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: eric.araujo, giampaolo.rodola, mark.dickinson, rhettinger, serhiy.storchaka, ssapin, stutzbach, terry.reedy
Priority: normal Keywords: needs review, patch

Created on 2012-01-09 10:35 by ssapin, last changed 2013-04-29 18:55 by serhiy.storchaka.

Files
File name Uploaded Description Edit
heapq_merge_key.patch ssapin, 2012-01-09 10:35 'hg diff' output against rev ca2a35140e6a review
benchmark_heapq_merge.py ssapin, 2012-01-09 10:43 benchmark several implementations
heapq_merge_key_duplicate.patch ssapin, 2012-01-16 14:58 'hg diff' output against rev ca2a35140e6a review
heap.diff rhettinger, 2013-03-26 06:54 Rough draft (untested) for a Heap() class review
heap2.diff rhettinger, 2013-04-29 11:43 Update the draft Heap() class
Messages (16)
msg150927 - (view) Author: Simon Sapin (ssapin) Date: 2012-01-09 10:35
Hi,

The attached patch adds a 'key' optional parameter to the heapq.merge function that behaves as in sorted().

Related discussion: http://mail.python.org/pipermail/python-ideas/2012-January/013295.html

This is my first contribution to CPython.
msg150928 - (view) Author: Simon Sapin (ssapin) Date: 2012-01-09 10:43
The attached script benchmarks the basline (current implementation) against 3 new implementations, as suggested on http://mail.python.org/pipermail/python-ideas/2012-January/013296.html

On my machine, the output is:

    merge_baseline
    per run, min of 3 = 7.527 ms
    
    merge_1
    per run, min of 3 = 9.894 ms
    131.449 % of baseline
    
    merge_2
    per run, min of 3 = 7.948 ms
    105.594 % of baseline
    
    merge_3
    per run, min of 3 = 7.581 ms
    100.716 % of baseline

On this particular input, merge_2 adds 6% of overhead when the key parameter is not used. While merge_3 only adds 1% of overhead, it almost doubles the amount of code. (Which was admittedly not that long to begin with.)

The patch in the previous message is with the merge_2 implementation, which seemed like the best compromise to me.
msg150931 - (view) Author: Simon Sapin (ssapin) Date: 2012-01-09 11:10
Oops, the patch to the documentation would also need 'New in 3.3: the key parameter', with the right Sphinx directive. But that depends on whether this change ends up in 3.3 or 3.4.

Does 3.3 still get new features?
msg150954 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-01-09 16:53
Yes, 3.3 is still in the early development stage, and new features will be accepted until the first beta (in June, see PEP 398).  “.. versionadded:: 3.3 The *key* parameter” will do.
msg150969 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2012-01-09 19:33
Simon, please keep the original version fast by creating two code paths:

   if key is None:
      original_code
   else:
      new_code using the key_function
msg150983 - (view) Author: Simon Sapin (ssapin) Date: 2012-01-09 22:13
Raymond, please have a look at merge_3 in benchmark_heapq_merge.py. It is implemented as you say.

Do you think the speed is worth the code duplication?
msg151369 - (view) Author: Simon Sapin (ssapin) Date: 2012-01-16 14:58
heapq_merge_key_duplicate.patch is a new patch with two code path. It also updates the function’s docstring (which the previous patch did not). Raymond, do you think the speed is worth the DRY violation?
msg152802 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2012-02-07 04:11
I'll look at this in the next couple of weeks.  Hang tight :-)
msg152984 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-02-09 20:39
FWIW, Guido approves of the idea, msg152969 in #4356
msg168070 - (view) Author: Simon Sapin (ssapin) Date: 2012-08-13 07:51
I just remembered about this. I suppose it is too late for 3.3?
msg168116 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-08-13 15:20
Yes, 3.3 is already in beta.
msg185259 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-03-26 06:54
Attaching a rough draft implementation for a fully encapsulated Heap() class that is thread-safe, supports minheaps and maxheaps, and efficiently implements key-functions (called no more than once per key).
msg188066 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2013-04-29 12:14
heap2.diff contains only a single line's change.  Wrong file attached?
msg188067 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2013-04-29 12:14
Ah, I see the new file now (I'd failed to refresh my browser);  sorry for the noise.
msg188080 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2013-04-29 17:51
Looks pretty good to me.

- There's a bonus print call in the diff.

- Should the "len(self._data)" call be protected by the lock?  I can't immediately think of any reason why that would be necessary (e.g., pushpop nd poppush never change the size of self._data, so there's no risk of getting a bogus length there), but the lack of the lock makes me nervous.

- Support for iter() seems a bit out of place to me.  What are the use-cases for this?  Would it make sense to leave this out (for now)?
msg188085 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-04-29 18:55
There is already one heap class in the stdlib: queue.PriorityQueue. Why create a duplicate instead extend queue.PriorityQueue with desired features?

May be name the maxheap parameter as reverse?
History
Date User Action Args
2013-04-29 18:55:40serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg188085
2013-04-29 17:51:35mark.dickinsonsetmessages: + msg188080
2013-04-29 12:14:41mark.dickinsonsetmessages: + msg188067
2013-04-29 12:14:03mark.dickinsonsetnosy: + mark.dickinson
messages: + msg188066
2013-04-29 11:43:52rhettingersetfiles: - heap2.diff
2013-04-29 11:43:35rhettingersetfiles: + heap2.diff
2013-04-29 10:54:52rhettingersetpriority: low -> normal
files: + heap2.diff
2013-03-26 06:54:31rhettingersetfiles: + heap.diff

messages: + msg185259
2012-08-13 15:20:10eric.araujosetkeywords: + needs review

stage: patch review
messages: + msg168116
versions: + Python 3.4, - Python 3.3
2012-08-13 07:51:59ssapinsetmessages: + msg168070
2012-02-09 20:39:38terry.reedysetnosy: + terry.reedy
messages: + msg152984
2012-02-07 08:44:03giampaolo.rodolasetnosy: + giampaolo.rodola
2012-02-07 04:11:24rhettingersetmessages: + msg152802
2012-01-16 14:58:54ssapinsetfiles: + heapq_merge_key_duplicate.patch

messages: + msg151369
2012-01-09 22:13:45ssapinsetmessages: + msg150983
2012-01-09 19:33:36rhettingersetpriority: normal -> low

messages: + msg150969
2012-01-09 18:44:52stutzbachsetnosy: + stutzbach
2012-01-09 16:53:30eric.araujosetnosy: + eric.araujo

messages: + msg150954
versions: - Python 3.4
2012-01-09 11:10:14ssapinsetmessages: + msg150931
2012-01-09 10:51:00eric.smithsetassignee: rhettinger
2012-01-09 10:43:36ssapinsetfiles: + benchmark_heapq_merge.py

messages: + msg150928
2012-01-09 10:35:41ssapincreate