Message 156062 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	terry.reedy
Recipients	docs@python, eli.bendersky, eric.araujo, patena, terry.reedy, tim.peters
Date	2012-03-16.18:01:24
SpamBayes Score	7.4201756e-13
Marked as misclassified	No
Message-id	<1331920885.34.0.290502995381.issue14332@psf.upfronthosting.co.za>
In-reply-to

Content
I reproduced the observed behavior in 3.3.0a. However, I am rather sure it is not a bug. In any case, linejunk is not ignored. Passing 'lambda x: 1/0' causes ZeroDivisionError, proving that it gets called. The body of ndiff(linejunk,charjunk,a,b) is return Differ(linejunk, charjunk).compare(a, b) Differ only uses the linejunk parameter here cruncher = SequenceMatcher(self.linejunk, a, b) SequenceMatcher uses the first parameter, isjunk, in the internal .__chain_b method to segregate (not remove) items expected to be common in order to speed up the .find_longest_match method. Read the docstring for that method (and possibly the code) to see how it affects matching. The main intent of the junk parameters is to speed up matching to find differences, not to mask differences. It does, however, affect output of the .ratio methods. The doc string for ndiff says "The default is None, and is recommended; as of Python 2.3, an adaptive notion of "noise" lines is used that does a good job on its own." That is a good idea. That said, I think the doc (and docstrings) should explain the notion of "junk" elements and what 'ignoring' them means. In particular, I think a couple of sentences should be added after "The idea is to find the longest contiguous matching subsequence that contains no “junk” elements (the Ratcliff and Obershelp algorithm doesn’t address junk)." The quotes around "junk" indicate that it is being used with a non-standard, module specific meaning. What is it? And what does 'ignore' (used several times later in the doc) mean? Tim, I think we may need your help here since 'junk' is your label for your concept and I am not sure I understand well enough to articulate it. (For one thing, given that the "common" heuristic was apparently meant to replace at least the linejunk version version of junk, I do not understand why .get_longest_match treats 'junk' and 'common' items differently, other than that the two concepts are apparently not the same.)

I reproduced the observed behavior in 3.3.0a.
However, I am rather sure it is not a bug.
In any case, linejunk is not ignored. Passing 'lambda x: 1/0' causes ZeroDivisionError, proving that it gets called.

The body of ndiff(linejunk,charjunk,a,b) is
return Differ(linejunk, charjunk).compare(a, b)
Differ only uses the linejunk parameter here
cruncher = SequenceMatcher(self.linejunk, a, b)

SequenceMatcher uses the first parameter, isjunk, in the internal .__chain_b method to segregate (not remove) items expected to be common in order to speed up the .find_longest_match method. Read the docstring for that method (and possibly the code) to see how it affects matching. The main intent of the *junk parameters is to speed up matching to find differences, not to mask differences. It does, however, affect output of the .*ratio methods.

The doc string for ndiff says "The default is None, and is recommended; as of Python 2.3, an adaptive notion of "noise" lines is used that does a good job on its own." That is a good idea.

That said, I think the doc (and docstrings) should explain the notion of "junk" elements and what 'ignoring' them means. In particular, I think a couple of sentences should be added after "The idea is to find the longest contiguous matching subsequence that contains no “junk” elements (the Ratcliff and Obershelp algorithm doesn’t address junk)." The quotes around "junk" indicate that it is being used with a non-standard, module specific meaning. What is it? And what does 'ignore' (used several times later in the doc) mean?

Tim, I think we may need your help here since 'junk' is your label for your concept and I am not sure I understand well enough to articulate it. (For one thing, given that the "common" heuristic was apparently meant to replace at least the linejunk version version of junk, I do not understand why .get_longest_match treats 'junk' and 'common' items differently, other than that the two concepts are apparently not the same.)

History
Date	User	Action	Args
2012-03-16 18:01:25	terry.reedy	set	recipients: + terry.reedy, tim.peters, eric.araujo, eli.bendersky, docs@python, patena
2012-03-16 18:01:25	terry.reedy	set	messageid: <1331920885.34.0.290502995381.issue14332@psf.upfronthosting.co.za>
2012-03-16 18:01:24	terry.reedy	link	issue14332 messages
2012-03-16 18:01:24	terry.reedy	create