Message 334796 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	rhettinger
Recipients	rhettinger, steven.daprano
Date	2019-02-03.18:51:25
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1549219885.57.0.86866159357.issue35892@roundup.psfhosted.org>
In-reply-to

Content
The current code for mode() does a good deal of extra work to support its two error outcomes (empty input and multimodal input). That latter case is informative but doesn't provide any reasonable way to find just one of those modes, where any of the most popular would suffice. This arises in nearest neighbor algorithms for example. I suggest adding an option to the API: def mode(seq, *, first_tie=False): if tie_goes_to_first: # CHOOSE FIRST x ∈ S \| ∄ y ∈ S : x ≠ y ∧ count(y) > count(x) return return Counter(seq).most_common(1)[0][0] ... Use it like this: >>> data = 'ABBAC' >>> assert mode(data, first_tie=True) == 'A' With the current API, there is no reasonable way to get to 'A' from 'ABBAC'. Also, the new code path is much faster than the existing code path because it extracts only the 1 most common using min() rather than the n most common which has to sort the whole items() list. New path: O(n). Existing path: O(n log n). Note, the current API is somewhat awkward to use. In general, a user can't know in advance that the data only contains a single mode. Accordingly, every call to mode() has to be wrapped in a try-except. And if the user just wants one of those modal values, there is no way to get to it. See https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mode.html for comparison. There may be better names for the flag. "tie_goes_to_first_encountered" seemed a bit long though ;-)

The current code for mode() does a good deal of extra work to support its two error outcomes (empty input and multimodal input).  That latter case is informative but doesn't provide any reasonable way to find just one of those modes, where any of the most popular would suffice.  This arises in nearest neighbor algorithms for example. I suggest adding an option to the API:

   def mode(seq, *, first_tie=False):       
       if tie_goes_to_first:
           # CHOOSE FIRST x ∈ S | ∄ y ∈ S : x ≠ y ∧ count(y) > count(x)
           return return Counter(seq).most_common(1)[0][0]
       ...

Use it like this:

    >>> data = 'ABBAC'
    >>> assert mode(data, first_tie=True) == 'A'

With the current API, there is no reasonable way to get to 'A' from 'ABBAC'.

Also, the new code path is much faster than the existing code path because it extracts only the 1 most common using min() rather than the n most common which has to sort the whole items() list.  New path: O(n).  Existing path: O(n log n).

Note, the current API is somewhat awkward to use.  In general, a user can't know in advance that the data only contains a single mode.  Accordingly, every call to mode() has to be wrapped in a try-except.  And if the user just wants one of those modal values, there is no way to get to it.  See https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mode.html for comparison.

There may be better names for the flag.  "tie_goes_to_first_encountered" seemed a bit long though ;-)

History
Date	User	Action	Args
2019-02-03 18:51:29	rhettinger	set	recipients: + rhettinger, steven.daprano
2019-02-03 18:51:25	rhettinger	set	messageid: <1549219885.57.0.86866159357.issue35892@roundup.psfhosted.org>
2019-02-03 18:51:25	rhettinger	link	issue35892 messages
2019-02-03 18:51:25	rhettinger	create