This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Dennis Sweeney
Recipients BTaskaya, Dennis Sweeney, aroberge, james, pablogsal, serhiy.storchaka, vstinner, xtreak, yselivanov
Date 2021-04-30.20:51:08
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1619815868.6.0.928194946461.issue38530@roundup.psfhosted.org>
In-reply-to
Content
Some research of other projects:


LLVM [1][2]
-----------

- Compute Levenshtein
    - Using O(n) memory rather than O(n^2)
- Uses UpperBound = (len(typo) + 2) // 3


GCC [3]
-------

- Uses Damerau-Levenshtein distance
    - Counts transpositions like "abcd" <-> "bacd" as one move
- Swapping Case as in "a" <-> "A" counts as half a move
- cutoff = (longer + 2) // 3 if longer - shorter >= 2 else max(longer // 3, 1)


Rust [4]
--------

- "maximum allowable edit distance defaults to one-third of the given word."
- First checks for exact case-insensitive match, then check for Levenshtein distance small enough, then check if sorted(a.split("_")) == sorted(b.split("_"))


Ruby [5]
--------
- Quickly filter out words with bad Jaro–Winkler distance
    - threshold = input.length > 3 ? 0.834 : 0.77
- Only compute Levenshtein for words that remain
    - threshold = (input.length * 0.25).ceil
    - Output all good enough words
- If no word was good enough then output the closest match.


I think there are some good ideas here.


[1] https://github.com/llvm/llvm-project/blob/d480f968ad8b56d3ee4a6b6df5532d485b0ad01e/llvm/include/llvm/ADT/edit_distance.h#L42
[2] https://github.com/llvm/llvm-project/blob/e2b3b89bf1ce74bf889897e0353a3e3fa93e4452/clang/lib/Sema/SemaLookup.cpp#L4263
[3] https://github.com/gcc-mirror/gcc/blob/16e2427f50c208dfe07d07f18009969502c25dc8/gcc/spellcheck.c
[4] https://github.com/rust-lang/rust/blob/673d0db5e393e9c64897005b470bfeb6d5aec61b/compiler/rustc_span/src/lev_distance.rs#L44
[5] https://github.com/ruby/ruby/blob/48b94b791997881929c739c64f95ac30f3fd0bb9/lib/did_you_mean/spell_checker.rb
History
Date User Action Args
2021-04-30 20:51:08Dennis Sweeneysetrecipients: + Dennis Sweeney, vstinner, aroberge, serhiy.storchaka, yselivanov, james, pablogsal, xtreak, BTaskaya
2021-04-30 20:51:08Dennis Sweeneysetmessageid: <1619815868.6.0.928194946461.issue38530@roundup.psfhosted.org>
2021-04-30 20:51:08Dennis Sweeneylinkissue38530 messages
2021-04-30 20:51:08Dennis Sweeneycreate