Message 335252 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients	chris.jerdonek, jaraco, tim.peters, xtreak
Date	2019-02-11.18:46:15
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1549910775.37.0.578936082231.issue35955@roundup.psfhosted.org>
In-reply-to

Content
difflib generally synchs on the longest contiguous matching subsequence that doesn't contain a "junk" element. By default, `ndiff()`'s optional `charjunk` argument considers blanks and tabs to be junk characters. In the strings: "drwxrwxr-x 2 2000 2000\n" "drwxr-xr-x 2 2000 2000\n" the longest matching substring not containing whitespace is "rwxr-x", of length 6, starting at index 4 in the first string and at index 1 in the second. So it's aligning the strings like so: "drwxrwxr-x 2 2000 2000\n" "drwxr-xr-x 2 2000 2000\n" 123456 That's why it wants to delete the 1:4 slice in the first string and insert "r-x" after the longest matching substring. The default is aimed at improving results for human-readable text, like prose and Python code, where stuff between whitespace is often read "as a whole" (words, keywords, identifiers, ...). For cases like this one, where character-by-character differences are important, it's often better to pass `charjunk=None`. Then the longest matching substring is "xr-x 2 2000 2000" at the tail end of both strings, and you get the output you're expecting.

difflib generally synchs on the longest contiguous matching subsequence that doesn't contain a "junk" element.  By default, `ndiff()`'s optional `charjunk` argument considers blanks and tabs to be junk characters.

In the strings:

"drwxrwxr-x 2 2000  2000\n"
"drwxr-xr-x 2 2000  2000\n"

the longest matching substring not containing whitespace is "rwxr-x", of length 6, starting at index 4 in the first string and at index 1 in the second.  So it's aligning the strings like so:

"drwxrwxr-x 2 2000  2000\n"
   "drwxr-xr-x 2 2000  2000\n"
     123456

That's why it wants to delete the 1:4 slice in the first string and insert "r-x" after the longest matching substring.

The default is aimed at improving results for human-readable text, like prose and Python code, where stuff between whitespace is often read "as a whole" (words, keywords, identifiers, ...).

For cases like this one, where character-by-character differences are important, it's often better to pass `charjunk=None`.  Then the longest matching substring is "xr-x 2 2000  2000" at the tail end of both strings, and you get the output you're expecting.

History
Date	User	Action	Args
2019-02-11 18:46:17	tim.peters	set	recipients: + tim.peters, jaraco, chris.jerdonek, xtreak
2019-02-11 18:46:15	tim.peters	set	messageid: <1549910775.37.0.578936082231.issue35955@roundup.psfhosted.org>
2019-02-11 18:46:15	tim.peters	link	issue35955 messages
2019-02-11 18:46:15	tim.peters	create