Issue 46667: SequenceMatcher & autojunk - false negative

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/90825

classification

Title:	SequenceMatcher & autojunk - false negative
Type:	behavior	Stage:
Components:	Library (Lib)	Versions:	Python 3.8

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	jonathan-lp, tim.peters
Priority:	normal	Keywords:

Created on 2022-02-06 21:58 by jonathan-lp, last changed 2022-04-11 14:59 by admin.

Messages (6)
msg412673 - (view)	Author: Jonathan (jonathan-lp)	Date: 2022-02-06 21:58
The following two strings are identical other than the text "UNIQUESTRING". UNIQUESTRING is at the start of first and at the end of second. Running the below gives the following output: 0.99830220713073 0.99830220713073 0.023769100169779286 # ratio 0.99830220713073 0.99830220713073 0.023769100169779286 # ratio As you can see, Ratio is basically 0. Remove either of the UNIQUESTRING pieces and it goes up to 0.98 (correct)... Remove both and you get 1.0 (correct) ``` from difflib import SequenceMatcher first = """ UNIQUESTRING Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum """ second = """ Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum UNIQUESTRING """ sm = SequenceMatcher(None, first, second, autojunk=False) print(sm.real_quick_ratio()) print(sm.quick_ratio()) print(sm.ratio()) print() sm2 = SequenceMatcher(None, second, first, autojunk=False) print(sm2.real_quick_ratio()) print(sm2.quick_ratio()) print(sm2.ratio()) ``` If I add `autojunk=False`, then I get a correct looking ratio (0.98...), however from my reading of the autojunk docs, UNIQUESTRING shouldn't be triggering it. Furthermore, looking in the code, as far as I can see autojunk is having no effect... Autojunk considers these items to be "popular" in that string: `{'n', 'p', 'a', 'h', 'e', 'u', 'I', 'r', 'k', 'g', 'y', 'm', 'c', 'd', 't', 'l', 'o', 's', ' ', 'i'}` If I remove UNIQUESTRING from `first`, this is the autojunk popular set: `{'c', 'p', 'a', 'u', 'r', 'm', 'k', 'g', 'I', 'd', ' ', 'o', 'h', 't', 'e', 'i', 'l', 's', 'y', 'n'}` They're identical! In both scenarios, `b2j` is also identical. I don't pretend to understand what the module is doing in any detail, but this certainly seems like a false positive/negative. Python 3.8.10
msg412675 - (view)	Author: Jonathan (jonathan-lp)	Date: 2022-02-06 22:06
(Like the idiot I am, the example code is wrong. `autojunk` parameter should not be set for either of them to get the stated wrong results). In place of "UNIQUESTRING", any unique 3 character string triggers it (QQQ, EEE, ZQU...). And in those cases you get a ratio of 0.008! (and 0.993 in the other direction!)
msg412676 - (view)	Author: Jonathan (jonathan-lp)	Date: 2022-02-06 22:08
Gah. I mean 0.008 in both directions. I'm just going to be quiet now. :-)
msg412677 - (view)	Author: Tim Peters (tim.peters) *	Date: 2022-02-06 22:42
SequenceMatcher looks for the longest _contiguous_ match. "UNIQUESTRING" isn't the longest by far when autojunk is False, but is the longest when autojunk is True. All those bpopular characters then effectively prevent finding a longer match than 'QUESTR' (capital 'I" is also in bpopular) directly. The effects of autojunk can be surprising, and it would have been better if it were False by default. But I don't see anything unexpected here. Learn from experience and force it to False yourself ;-) BTW, it was introduced as a way to greatly speed comparing files of code, viewing them as sequences of lines. In that context, autojunk is rarely surprising and usually helpful. But it more often backfires when comparing strings (viewed as sequences of characters) :-(
msg412752 - (view)	Author: Jonathan (jonathan-lp)	Date: 2022-02-07 15:01
I still don't get how UNIQUESTRING is the longest even with autojunk=True, but that's an implementation detail and I'll trust you that it's working as expected. Given this, I'd suggest the following then: * `Autojunk=False` should be the default unless there's some reason to believe SequenceMatcher is mostly used for code comparisons. * If - for whatever reason - the default can't be changed, I'd suggest a nice big docs "Warning" (at a minimum a "Note") saying something like "The default autojunk=True is not suitable for normal string comparison. See autojunk for more information". * Human-friendly doc explanation for autojunk. The current explanation is only going to be helpful to the tiny fraction of users who understand the algorithm. Your explanation is a good start: "Autojunk was introduced as a way to greatly speed comparing files of code, viewing them as sequences of lines. But it more often backfires when comparing strings (viewed as sequences of characters)" Put simply: The current docs aren't helpful to users who don't have text matching expertise, nor do they emphasise the huge caveat that autojunk=True raises.
msg412779 - (view)	Author: Tim Peters (tim.peters) *	Date: 2022-02-07 18:14
We can't change defaults without superb reason - Python has millions of users, and changing the output of code "that works" is almost always a non-starter. Improvements to the docs are welcome. In your example, try running this code after using autojunk=True: pending = "" for ch in first: if ch in sm.bpopular: if pending: print(repr(pending)) pending = "" else: pending += ch print(repr(pending)) That shows how `first` is effectively broken into tiny pieces given that the "popular" chaaracters act like walls. Here's the start of the output: '\nUN' 'QUESTR' 'NG\nL' 'x' 'f' '.' 'L' 'b' "'" 'x' 'v' '1500' ',' and on & on. `QUESTER' is the longest common contiguous substring remaining.

History
Date	User	Action	Args
2022-04-11 14:59:55	admin	set	github: 90825
2022-02-07 18:14:16	tim.peters	set	messages: + msg412779
2022-02-07 15:01:52	jonathan-lp	set	messages: + msg412752
2022-02-06 22:42:41	tim.peters	set	nosy: + tim.peters messages: + msg412677
2022-02-06 22:08:12	jonathan-lp	set	messages: + msg412676
2022-02-06 22:06:38	jonathan-lp	set	messages: + msg412675
2022-02-06 21:58:45	jonathan-lp	create