Author David Su2
Recipients David Su2, rhettinger
Date 2016-07-23.18:05:28
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1469297129.91.0.759319864373.issue27575@psf.upfronthosting.co.za>
In-reply-to
Content
I am trying to implement the following algorithm:

http://blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/

This algorithm is used for fast nearest neighbor queries for spell correction. 

Given a corpus of strings, for each string s, I generate all subsequences (Note: subsequence, not substring) of that string length (len(n) - k), call that subsequence t, where k is a tunable parameter. Then, I create a dict mapping the subsequences t to the original string s.

This process creates a very large dict from a relatively small corpus. For example, my corpus of ~500k strings blew up to a dict of ~10M keys, with k=3.

Then, for a query string q, you generate all subsequences of (len(q) - k), and perform an intersection with the keys of the dict. Then, the values of the dict corresponding to the key intersection will be possible misspelling of the query string q.

In pseudo-python:

def get_subseq(s: str, k: int):
    returns all subsequences of s of length len(s) - k

def build_dict(corpus: List[str], k: int):
    d = defaultdict(list)

    for orig_str in corpus:
        for subseq in get_subseq(orig_str, k):
            d[subseq].append(orig_str)

    return d

def query(d: dict, q: str, k:int):
    q_subseq = set(get_subseq(q, k))
    intersection = d.viewkeys() & q_subseq # this is slow when len(d) is large
    return [orig_str for k in intersection for orig_str in d[k]]
    

By exposing an intersection interface, a certain degree of performance is expected by the user. It took me a considerable amount of debugging to realize that the dict.viewkeys() intersection was the performance bottleneck. I finally decided to dig into the CPython source code before discovering the root cause of the issue. While I am familiar with C and found the issue relatively quickly, for those that aren't, it should not be necessary to dig into the source code for a mainstream programming language like Python.


I am currently looking at how to potentially fix this in the CPython repository. I may submit a patch in the near future if I find a good solution.
History
Date User Action Args
2016-07-23 18:05:29David Su2setrecipients: + David Su2, rhettinger
2016-07-23 18:05:29David Su2setmessageid: <1469297129.91.0.759319864373.issue27575@psf.upfronthosting.co.za>
2016-07-23 18:05:29David Su2linkissue27575 messages
2016-07-23 18:05:28David Su2create