Message 363618 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	brian.gallagher
Recipients	brian.gallagher
Date	2020-03-07.23:27:55
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1583623675.84.0.678348320324.issue39891@roundup.psfhosted.org>
In-reply-to

Content
Currently difflib's get_close_matches() doesn't match similar words that differ in their casing very well. Example: user@host:~$ python3 Python 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import difflib >>> difflib.get_close_matches("apple", "APPLE") [] >>> difflib.get_close_matches("apple", "APpLe") [] >>> These seem like they should be considered close matches for each other, given the SequenceMatcher used in difflib.py attempts to produce a "human-friendly diff" of two words in order to yield "intuitive difference reports". One solution would be for the user of the function to perform their own transformation of the supplied data, such as converting all strings to lower-case for example. However, it seems like this might be a surprise to a user of the function if they weren't aware of this limitation. It would be preferable to provide this functionality by default in my eyes. If this is an issue the relevant maintainer(s) consider worth pursuing, I'd love to try my hand at preparing a patch for this.

Currently difflib's get_close_matches() doesn't match similar words that differ in their casing very well.

Example:
user@host:~$ python3
Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import difflib
>>> difflib.get_close_matches("apple", "APPLE")
[]
>>> difflib.get_close_matches("apple", "APpLe")
[]
>>>

These seem like they should be considered close matches for each other, given the SequenceMatcher used in difflib.py attempts to produce a "human-friendly diff" of two words in order to yield "intuitive difference reports".

One solution would be for the user of the function to perform their own transformation of the supplied data, such as converting all strings to lower-case for example. However, it seems like this might be a surprise to a user of the function if they weren't aware of this limitation. It would be preferable to provide this functionality by default in my eyes.

If this is an issue the relevant maintainer(s) consider worth pursuing, I'd love to try my hand at preparing a patch for this.

History
Date	User	Action	Args
2020-03-07 23:27:55	brian.gallagher	set	recipients: + brian.gallagher
2020-03-07 23:27:55	brian.gallagher	set	messageid: <1583623675.84.0.678348320324.issue39891@roundup.psfhosted.org>
2020-03-07 23:27:55	brian.gallagher	link	issue39891 messages
2020-03-07 23:27:55	brian.gallagher	create