classification
Title: save scores or ratios in difflib get_close_matches
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 2.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: Claudiu.Popa, michaelohlrogge, russellballestrini, tim.peters, zach.ware
Priority: normal Keywords: patch

Created on 2014-04-24 14:42 by russellballestrini, last changed 2014-10-23 21:42 by michaelohlrogge. This issue is now closed.

Files
File name Uploaded Description Edit
diff-lib-get-scored-matches-tests-and-docs.patch russellballestrini, 2014-04-25 01:49 difflib get_scored_matches function docs tests
diff-lib-tim-peters-assert-list-equals.patch russellballestrini, 2014-04-25 17:54 Tim Peters suggested assertListEqual over assertEqual
Messages (17)
msg217123 - (view) Author: Russell Ballestrini (russellballestrini) * Date: 2014-04-24 14:41
The current implementation of difflib's get_close_matches() function computes computationally complex scores (ratios) but then tosses them out without giving the end-user the chance to have at them.

This patch adds an optional "scores" boolean argument that may be passed to alter the return output from a list of words, to a list of (score, word) tuples.
msg217124 - (view) Author: PCManticore (Claudiu.Popa) * (Python triager) Date: 2014-04-24 14:58
It would be easier to review your patch if you'll upload it as a proper patch.

Usually for these cases (modifying the return by passing a specific argument) it's best to provide a new function with this functionality, by having get_close_matches and get_scored_close_matches (for instance),
which returns the modified result.
msg217125 - (view) Author: PCManticore (Claudiu.Popa) * (Python triager) Date: 2014-04-24 14:59
Ah, nevermind my first comment.
msg217126 - (view) Author: Russell Ballestrini (russellballestrini) * Date: 2014-04-24 15:08
Claudiu.Popa,

Yes, that was my first idea on how to tackle this issue.

I will create another proper patch that prepares two separate functions:

* get_close_matches
* get_scored_close_matches

Where each are basically wrapper / API functions around a private function that holds the algorithm:

* _get_scored_close_matches
msg217127 - (view) Author: Russell Ballestrini (russellballestrini) * Date: 2014-04-24 15:53
New function in difflib: get_scored_matches()

This function acts just like the existing get_close_matches()
function however instead of returning a list of words, it 
returns a list of tuples (score, word) pairs.

This gives the end-user the ability to access the
computationally expensive scores/ratios produced as a by-product.

The new usage does _not_ impact backward compatibility::

  >>> import difflib
  >>> import keyword as _keyword
  >>> difflib.get_scored_matches("wheel", _keyword.kwlist)
  [(0.6, 'while')]
  >>> difflib.get_close_matches("wheel", _keyword.kwlist)
  ['while']

HG: Enter commit message.  Lines beginning with 'HG:' are removed.
HG: Leave message empty to abort commit.
HG: --
HG: user: RussellBallestrini
HG: branch 'default'
changed Lib/difflib.py
msg217130 - (view) Author: PCManticore (Claudiu.Popa) * (Python triager) Date: 2014-04-24 16:04
Your patch needs tests and documentation update. For examples, you could look in test_difflib.py and see how get_close_matches is tested.
msg217131 - (view) Author: Russell Ballestrini (russellballestrini) * Date: 2014-04-24 16:14
get_close_matches() doesn't seem to have any tests... I suppose I should write them considering I'm changing the functionality a bit.  

TODO: write tests for 

* difflib.get_close_matches()
* difflib.get_scored_matches()

Determine if docstrings are enough to document the new function.  (I thought it would be)
msg217132 - (view) Author: Zachary Ware (zach.ware) * (Python committer) Date: 2014-04-24 17:13
Russell Ballestrini wrote:
> Determine if docstrings are enough to document the new function.

No, Doc/library/difflib.rst will need an update.

(Btw, I removed 2.7 from versions because 2.7 is not open to new features, bugs and security-critical enhancements only.)
msg217145 - (view) Author: Russell Ballestrini (russellballestrini) * Date: 2014-04-25 01:49
Ok, this patch is ready for review.
msg217146 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2014-04-25 03:04
I wonder whether this new function would attract any users, given that the user already has control over the smallest ratio that will be accepted, and over the maximum number of close matches returned.  That's always been sufficient for me.

What useful thing(s) can the user do with the scores?  If there are compelling uses, in _those_ contexts are the `n` and `cutoff` arguments useful too?  Or would it, for example, be more useful to generate all (score, word) pairs and let the user filter them as they wish?  Without a concrete use case, there's no clear answer.

About existing tests for `get_close_matches()`, those are in the function's docstring.  doctest checks them.

About the new tests in the patch, note that comparing lists for equality "should be" done via AssertListEqual, not via AssertEqual.  Don't ask me why, but someone will eventually yell about it ;-)
msg217150 - (view) Author: Russell Ballestrini (russellballestrini) * Date: 2014-04-25 04:05
At some point I plan to write a web API that accepts a word, 'doge' and returns a list of possible suggestions and scores.  Later a "did you mean dog" style suggestion could be implemented on top.

We compute the scores, and it is computationally taxing, we shouldn't always throw this data away.  Most users will continue to use get_close_matches, some users might want to build indexes on the scores.  

Other users may want to cache (memonize) common queries for super fast look ups.  Additionally the new function will give end-users the opportunity to inspect the scoring algos output.

I prefer to use the same arg spec because it is already widely understood and documented.
msg217166 - (view) Author: Russell Ballestrini (russellballestrini) * Date: 2014-04-25 17:54
Adding patch to update tests to use Tim Peters suggestion of assertListEqual over assertEqual for list compares.
msg217171 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2014-04-25 21:04
Russell, I'm still looking for a sufficiently compelling "use case" here:  something tangible and useful that can be done with the new function that can't be easily done now.

"I plan to write a web API that accepts a word, 'doge' and returns a list of possible suggestions and scores" is not a use case for scores.  It's merely tautological that if you want to return scores then you need a function that does return scores.  A "use case" would more address _why_ the scores are useful.  What would the user of your web API _do_ with the scores?  What's the point?

"users may want to cache (memonize) common queries for super fast look ups" isn't a use case for scores either.  If they wanted to, they could already cache the results of calling `get_close_matches()` - the results of any function can be cached; exposing scores has nothing to do with whether results can be cached.

"the new function will give end-users the opportunity to inspect the scoring algos output" is also more tautological than a use case.  _Why_ would a user want to stare at the scores?  What useful things(s) could they do with them?

I was added to this issue because I wrote these functions to begin with.  At the time, I thought - and asked - about exposing the scores, but nobody (including me) had a _use_ for doing so that justified the added bother of writing & maintaining the additional code and tests and docs.

I'm stilling looking for a use here more substantial than, essentially, just saying "well, without showing the scores we can't show the scores". To me the scores just aren't interesting beyond which words' scores exceed a cutoff, and the ordering of words based on their similarity scores - but `get_close_matches()` already captures those uses.  What other use(s) _do_ you have for the scores?  I'm afraid "just to display them" isn't compelling enough - you're the only one ever to ask for that, and you already know how to do it yourself ;-)
msg217174 - (view) Author: Russell Ballestrini (russellballestrini) * Date: 2014-04-25 22:18
Tim,

You bring up some great points and insight I was missing.

"To me the scores just aren't interesting beyond which words' scores exceed a cutoff, and the ordering of words based on their similarity scores - but `get_close_matches()` already captures those uses."

For a *word*, and a corpus of *possibilities*, how does one choose a satisfactory *cutoff* without inspecting the output of the scoring algorithm?

Personally, I don't want to inpect scores for inspection sake, I want to inspect scores so I can make an informed decision for the *n* and *cutoff* input arguments.

Its true that after reading and digesting the source code for `get_close_matches()` I could (and did) implement a version that returns scores.  My goal was to share this code and what better way then to "fix" the problem upstream.

I understand the desire to keep the standard library lean and useful to reduce the amount of burden the code is to maintain.  I will understand if we decide not to include these patches, I can always maintain a fork and share on pypi.
msg219710 - (view) Author: Zachary Ware (zach.ware) * (Python committer) Date: 2014-06-03 18:38
Absent Tim's satisfaction regarding a use-case, I'm closing the issue.  Russell, if you do package this up for pypi and it does become popular (for some definition of 'popular' :), feel free to reopen this issue or open a new one.
msg229896 - (view) Author: Michael Ohlrogge (michaelohlrogge) Date: 2014-10-23 19:49
This is my first time posting here, so apologies if I'm breaking rules.

I'd like to put in a vote in favor of this patch to get the matching scores.

I am a researcher at Stanford University using this tool to match up about 100,000 different names of companies/entities in two different datasets that I have.  The names reflect the same underlying entities but because they come from different datasets, the spellings, abbreviations, etc. differ.

It would be helpful to me to be able to run the get_scored_close_matches() function and then sort the results by how close the matches were.  If I could for instance determine, based on some spot checking / sampling of the results, that everything with a match above a certain threshold is almost certainly correct, whereas those below a certain threshold need to be reviewed by hand, that would be helpful for me.  

I suppose I can accomplish something similar by playing around with setting the matching threshold at different levels.  Nevertheless, with as many possible matches as I am doing, the algorithm takes a decent amount of time to run, and I don't have a good way to know ex-ante what a reasonable threshold would be.

Just in general, I think it can be useful information for users to know how much confidence to have in the matches produced by the algorithm.  Users could choose to formulate this confidence either as a direct function of the score or perhaps based on some other factors, such as a statistical analysis procedure that takes the score into account.  

Thanks to everyone who put this package together and who suggested the patch.
msg229907 - (view) Author: Michael Ohlrogge (michaelohlrogge) Date: 2014-10-23 21:42
Another way the scores could be useful would be to write an algorithm that would give you a number of possible answers based on the scores that you get.  In other words, for example, perhaps if one of the possible matches has a score about .9, then it would only give you one, but if all were below .8, it would give you several.  Or, if the highest score were at least .1 greater than the next highest, it would only give you one, but if there were a bunch that were close together, it would return those.  

I'm not saying these specific applications should be part of the package, they are just more examples of how you could productively use the scores.
History
Date User Action Args
2014-10-23 21:42:05michaelohlroggesetmessages: + msg229907
2014-10-23 19:49:51michaelohlroggesetnosy: + michaelohlrogge

messages: + msg229896
versions: + Python 2.7, - Python 3.5
2014-06-03 18:38:32zach.waresetstatus: open -> closed
resolution: rejected
messages: + msg219710

stage: patch review -> resolved
2014-04-25 22:18:28russellballestrinisetmessages: + msg217174
2014-04-25 21:04:50tim.peterssetmessages: + msg217171
2014-04-25 17:54:14russellballestrinisetfiles: + diff-lib-tim-peters-assert-list-equals.patch

messages: + msg217166
2014-04-25 04:05:34russellballestrinisetmessages: + msg217150
2014-04-25 03:04:14tim.peterssetmessages: + msg217146
2014-04-25 01:49:39russellballestrinisetfiles: + diff-lib-get-scored-matches-tests-and-docs.patch

messages: + msg217145
2014-04-25 01:47:32russellballestrinisetfiles: - difflib-patch-to-save-scores.patch
2014-04-24 20:54:05russellballestrinisetfiles: - difflib-patch-to-save-scores-in-get-close-matches.patch
2014-04-24 19:48:08rhettingersetnosy: + tim.peters
2014-04-24 17:13:38zach.waresetnosy: + zach.ware
messages: + msg217132
2014-04-24 16:14:04russellballestrinisetmessages: + msg217131
2014-04-24 16:12:43zach.waresetstage: patch review
versions: - Python 2.7
2014-04-24 16:04:03Claudiu.Popasetmessages: + msg217130
2014-04-24 15:53:00russellballestrinisetfiles: + difflib-patch-to-save-scores.patch

messages: + msg217127
2014-04-24 15:08:08russellballestrinisetmessages: + msg217126
2014-04-24 14:59:59Claudiu.Popasetmessages: + msg217125
2014-04-24 14:58:45Claudiu.Popasetnosy: + Claudiu.Popa
messages: + msg217124
2014-04-24 14:53:57russellballestrinisetfiles: + difflib-patch-to-save-scores-in-get-close-matches.patch
keywords: + patch
2014-04-24 14:52:53russellballestrinisetfiles: - difflib.py
2014-04-24 14:42:10russellballestrinicreate