Author michaelohlrogge
Recipients Claudiu.Popa, michaelohlrogge, russellballestrini, tim.peters, zach.ware
Date 2014-10-23.19:49:50
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1414093791.51.0.809826247969.issue21344@psf.upfronthosting.co.za>
In-reply-to
Content
This is my first time posting here, so apologies if I'm breaking rules.

I'd like to put in a vote in favor of this patch to get the matching scores.

I am a researcher at Stanford University using this tool to match up about 100,000 different names of companies/entities in two different datasets that I have.  The names reflect the same underlying entities but because they come from different datasets, the spellings, abbreviations, etc. differ.

It would be helpful to me to be able to run the get_scored_close_matches() function and then sort the results by how close the matches were.  If I could for instance determine, based on some spot checking / sampling of the results, that everything with a match above a certain threshold is almost certainly correct, whereas those below a certain threshold need to be reviewed by hand, that would be helpful for me.  

I suppose I can accomplish something similar by playing around with setting the matching threshold at different levels.  Nevertheless, with as many possible matches as I am doing, the algorithm takes a decent amount of time to run, and I don't have a good way to know ex-ante what a reasonable threshold would be.

Just in general, I think it can be useful information for users to know how much confidence to have in the matches produced by the algorithm.  Users could choose to formulate this confidence either as a direct function of the score or perhaps based on some other factors, such as a statistical analysis procedure that takes the score into account.  

Thanks to everyone who put this package together and who suggested the patch.
History
Date User Action Args
2014-10-23 19:49:51michaelohlroggesetrecipients: + michaelohlrogge, tim.peters, Claudiu.Popa, zach.ware, russellballestrini
2014-10-23 19:49:51michaelohlroggesetmessageid: <1414093791.51.0.809826247969.issue21344@psf.upfronthosting.co.za>
2014-10-23 19:49:51michaelohlroggelinkissue21344 messages
2014-10-23 19:49:50michaelohlroggecreate