classification
Title: SequenceMatcher bug with long sequences
Type: Stage:
Components: Library (Lib) Versions: Python 3.0, Python 2.5
process
Status: closed Resolution: duplicate
Dependencies: Superseder: difflib.SequenceMatcher not matching long sequences
View: 2986
Assigned To: Nosy List: LambertDW, eli.bendersky, ggenellina, terry.reedy
Priority: normal Keywords:

Created on 2008-12-10 17:20 by eli.bendersky, last changed 2010-06-25 21:55 by terry.reedy. This issue is now closed.

Messages (5)
msg77559 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2008-12-10 17:20
Here's a reproduction of the error:

Python 2.5.2 (r252:60911, Oct 20 2008, 09:11:31)
[GCC 3.4.6 20060404 (Red Hat 3.4.6-10)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import difflib
>>>
>>> difflib.SequenceMatcher(None, [4] + [5] * 200, [5] * 200).ratio()
0.0

ratio() should be returning close to 1.0 here, not 0. This is only a
problem for sequences longer than 200. The analogous run for 100:

>>> difflib.SequenceMatcher(None, [4] + [5] * 100, [5] * 100).ratio()
0.99502487562189057
>>>


I've managed to reproduce it on Linux, Windows (AS 2.5.2) and Try Python
(http://try-python.mired.org/)
msg77561 - (view) Author: David W. Lambert (LambertDW) Date: 2008-12-10 18:02
Python 3.0rc1+ similar.
msg77565 - (view) Author: Gabriel Genellina (ggenellina) Date: 2008-12-10 19:38
Python 2.3.4 and later have this bug. But release 2.1.3 doesn't:

Python 2.1.3 (#35, Apr  8 2002, 17:47:50) [MSC 32 bit (Intel)] on win32
Type "copyright", "credits" or "license" for more information.
>>> import difflib
>>> difflib.SequenceMatcher(None, [4] + [5] * 500, [5] * 500).ratio()
0.99900099900099903
>>> difflib.SequenceMatcher(None, [4] + [5] * 200, [5] * 200).ratio()
0.99750623441396513
>>> difflib.SequenceMatcher(None, [4] + [5] * 100, [5] * 100).ratio()
0.99502487562189057

I don't have any 2.2 release to test right now.
msg77566 - (view) Author: Gabriel Genellina (ggenellina) Date: 2008-12-10 19:42
#2986 may be a duplicate of this; #1528074 is relevant too.
msg108635 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-06-25 21:55
This appears to be one of at least three duplicate issues: #1528074, #2986, and #4622. I am closing two, leaving 2986 open, and merging the nearly disjoint nosy lists. (If no longer interested, you can delete yourself from 2986.) #1711800 appears to be slightly different (if not, it could be closed also.)

Whether or not a new feature is ever added (earliest, now, 3.2), it appears that the docs need improvement to at least explain the current behavior. If someone who understands the issue could open a separate doc issue (for 2.6/7/3.1/2) with a suggested addition, that would be great.
History
Date User Action Args
2010-06-25 21:55:55terry.reedysetstatus: open -> closed

nosy: + terry.reedy
messages: + msg108635

superseder: difflib.SequenceMatcher not matching long sequences
resolution: duplicate
2008-12-10 19:42:21ggenellinasetmessages: + msg77566
2008-12-10 19:38:27ggenellinasetnosy: + ggenellina
messages: + msg77565
2008-12-10 18:02:57LambertDWsetnosy: + LambertDW
messages: + msg77561
versions: + Python 3.0
2008-12-10 17:20:54eli.benderskycreate