This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: difflib reports incorrect location of mismatch
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.8, Python 3.7, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: chris.jerdonek, jaraco, tim.peters, xtreak
Priority: normal Keywords:

Created on 2019-02-10 14:11 by jaraco, last changed 2022-04-11 14:59 by admin.

Messages (13)
msg335154 - (view) Author: Jason R. Coombs (jaraco) * (Python committer) Date: 2019-02-10 14:11
In [this job](https://travis-ci.org/jaraco/cmdix/jobs/491246158), a project is using assertEqual to compare two directory listings that don't match in the group. But the hint markers pointing to the mismatch are pointing at positions that match:

E       AssertionError: '--w-[50 chars]drwxrwxr-x 2 2000  2000    4096 2019-02-10 14:[58 chars]oo\n' != '--w-[50 chars]drwxr-xr-x 2 2000  2000    4096 2019-02-10 14:[58 chars]oo\n'
E         --w-r---wx 1 2000  2000  999999 2019-02-10 14:02 bar
E       - drwxrwxr-x 2 2000  2000    4096 2019-02-10 14:02 biz
E       ?  ---
E       + drwxr-xr-x 2 2000  2000    4096 2019-02-10 14:02 biz
E       ?        +++
E       - -rw-rw-r-- 1 2000  2000     100 2019-02-10 14:02 foo
E       ? ---
E       + -rw-r--r-- 1 2000  2000     100 2019-02-10 14:02 foo
E       ?        +++

As you can see, it's the 'group' section of the flags that differ between the left and right comparison, but the hints point at the 'user' section for the left side and the 'world' section for the right side, even though they match.

I observed this on Python 3.7.1. I haven't delved deeper to see if the issue exists on 3.7.2 or 3.8.
msg335155 - (view) Author: Jason R. Coombs (jaraco) * (Python committer) Date: 2019-02-10 14:14
I should acknowledge that I'm using pytest here also... and pytest may be the engine that's performing the reporting of the failed assertion.

In fact, switching to simple assertions, I see the same behavior, so I now suspect the issue may lie with pytest and not unittest.
msg335156 - (view) Author: Jason R. Coombs (jaraco) * (Python committer) Date: 2019-02-10 14:24
I was able to replicate the issue using pytest and not unittest, so I've [reported the issue with that project](https://github.com/pytest-dev/pytest/issues/4765).
msg335158 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2019-02-10 14:58
Sorry to comment on a closed issue. I see the following behavior with difflib.ndiff which is used under the hood by unittest. The strings that differ by '-' and 'w' generate different output compared to 'a' and 'w'. I find the output for diff using '-' and 'w' little confusing and is this caused due to '-' which is also used as a marker in difflib?

$ ./python.exe
Python 3.8.0a1+ (heads/master:8a03ff2ff4, Feb  9 2019, 10:42:29)
[Clang 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import difflib
>>> print(''.join(difflib.ndiff(["drwxrwxr-x 2 2000  2000\n"], ["drwxr-xr-x 2 2000  2000\n"])))
- drwxrwxr-x 2 2000  2000
?  ---
+ drwxr-xr-x 2 2000  2000
?        +++

>>> print(''.join(difflib.ndiff(["drwxrwxr-x 2 2000  2000\n"], ["drwxraxr-x 2 2000  2000\n"])))
- drwxrwxr-x 2 2000  2000
?      ^
+ drwxraxr-x 2 2000  2000
?      ^
msg335231 - (view) Author: Jason R. Coombs (jaraco) * (Python committer) Date: 2019-02-11 15:57
I'm re-opening this issue as it does seem to apply stdlib (difflib.ndiff), which is why I encountered it both in unittest and pytest. Thanks xtreak for the distilled example.
msg335235 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2019-02-11 16:12
I have tried with different places where only '-' and 'w' differ. They seemed to produce correct diff except for this once case where the diff was confusing.
msg335246 - (view) Author: Chris Jerdonek (chris.jerdonek) * (Python committer) Date: 2019-02-11 18:20
Is this a duplicate of issue24780?
msg335247 - (view) Author: Jason R. Coombs (jaraco) * (Python committer) Date: 2019-02-11 18:25
I don't think so, because the issue happens on a single line diff... although it's plausible there's a common-mode fix.
msg335248 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2019-02-11 18:32
I am not sure this is a duplicate since the other issue was about newline at the end of strings. This is about the diff being little irrelevant even with newline in the end for strings. Sample program where change in 5th character gives the reported diff.

import difflib

for i in range(7):
    print(f"Change character at {i}")
    a = list("drwxrwxr-x 2 2000  2000\n")
    b = "drwxrwxr-x 2 2000  2000\n"
    a[i] = '-'
    a = ''.join(a)
    print(''.join(difflib.ndiff([a], [b])))

Change character at 0
- -rwxrwxr-x 2 2000  2000
? ^
+ drwxrwxr-x 2 2000  2000
? ^

Change character at 1
- d-wxrwxr-x 2 2000  2000
?  ^
+ drwxrwxr-x 2 2000  2000
?  ^

Change character at 2
- dr-xrwxr-x 2 2000  2000
?   ^
+ drwxrwxr-x 2 2000  2000
?   ^

Change character at 3
- drw-rwxr-x 2 2000  2000
?    ^
+ drwxrwxr-x 2 2000  2000
?    ^

Change character at 4
- drwx-wxr-x 2 2000  2000
?     ^
+ drwxrwxr-x 2 2000  2000
?     ^

Change character at 5
- drwxr-xr-x 2 2000  2000
?        ---
+ drwxrwxr-x 2 2000  2000
?  +++

Change character at 6
- drwxrw-r-x 2 2000  2000
?       ^
+ drwxrwxr-x 2 2000  2000
?       ^
msg335252 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-02-11 18:46
difflib generally synchs on the longest contiguous matching subsequence that doesn't contain a "junk" element.  By default, `ndiff()`'s optional `charjunk` argument considers blanks and tabs to be junk characters.

In the strings:

"drwxrwxr-x 2 2000  2000\n"
"drwxr-xr-x 2 2000  2000\n"

the longest matching substring not containing whitespace is "rwxr-x", of length 6, starting at index 4 in the first string and at index 1 in the second.  So it's aligning the strings like so:

"drwxrwxr-x 2 2000  2000\n"
   "drwxr-xr-x 2 2000  2000\n"
     123456

That's why it wants to delete the 1:4 slice in the first string and insert "r-x" after the longest matching substring.

The default is aimed at improving results for human-readable text, like prose and Python code, where stuff between whitespace is often read "as a whole" (words, keywords, identifiers, ...).

For cases like this one, where character-by-character differences are important, it's often better to pass `charjunk=None`.  Then the longest matching substring is "xr-x 2 2000  2000" at the tail end of both strings, and you get the output you're expecting.
msg335257 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2019-02-11 19:04
Thanks for the explanation. This seems to give the desired diff with charjunk=None passed to multiline string comparison helper. I am not sure how useful it would be to pass it to sequence and dict comparison that also use ndiff. I can open a PR if it's okay with the set of strings in the report as a test case. There are no test case failures in existing unittest folder test suite so this seems like a safe change to me.


# With patch charjunk=None

./python.exe ../backups/bpo35955_1.py
F
======================================================================
FAIL: test_foo (__main__.FooTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "../backups/bpo35955_1.py", line 6, in test_foo
    self.assertEqual("drwxrwxr-x 2 2000  2000\n", "drwxr-xr-x 2 2000  2000\n")
AssertionError: 'drwxrwxr-x 2 2000  2000\n' != 'drwxr-xr-x 2 2000  2000\n'
- drwxrwxr-x 2 2000  2000
?      ^
+ drwxr-xr-x 2 2000  2000
?      ^


----------------------------------------------------------------------
Ran 1 test in 0.003s

FAILED (failures=1)

# Without patch

➜  cpython git:(master) ✗ python3.7 ../backups/bpo35955_1.py
F
======================================================================
FAIL: test_foo (__main__.FooTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "../backups/bpo35955_1.py", line 6, in test_foo
    self.assertEqual("drwxrwxr-x 2 2000  2000\n", "drwxr-xr-x 2 2000  2000\n")
AssertionError: 'drwxrwxr-x 2 2000  2000\n' != 'drwxr-xr-x 2 2000  2000\n'
- drwxrwxr-x 2 2000  2000
?  ---
+ drwxr-xr-x 2 2000  2000
?        +++


----------------------------------------------------------------------
Ran 1 test in 0.002s

FAILED (failures=1)
msg335268 - (view) Author: Jason R. Coombs (jaraco) * (Python committer) Date: 2019-02-11 20:05
Nice insight Tim.
msg335269 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-02-11 20:07
It's probably OK, but there's no "pure win" to be had here.  There's generally more than one way to convert one string to another, and what "looks right" to humans depends a whole lot on context.

For example, consider these strings:

"private Thread currentThread;"
"private volatile Thread currentThread;"

"It's obvious" someone inserted "volatile" into the first string, and that's what ndiff's default says:

- private Thread currentThread;
+ private volatile Thread currentThread;
?         +++++++++

However, pass `charjunk=None` instead, and ndiff claims someone inserted "e volatil" after the "t" in "private":

- private Thread currentThread;
+ private volatile Thread currentThread;
?       +++++++++

Which is also a correct way, but - to human eyes - an insane way ;-)
History
Date User Action Args
2022-04-11 14:59:11adminsetgithub: 80136
2019-02-11 20:07:41tim.peterssetmessages: + msg335269
2019-02-11 20:05:52jaracosetmessages: + msg335268
2019-02-11 19:04:34xtreaksetmessages: + msg335257
2019-02-11 18:46:15tim.peterssetmessages: + msg335252
2019-02-11 18:32:16xtreaksetmessages: + msg335248
2019-02-11 18:25:41jaracosetmessages: + msg335247
2019-02-11 18:20:56chris.jerdoneksetnosy: + chris.jerdonek
messages: + msg335246
2019-02-11 16:12:19xtreaksetversions: + Python 2.7, Python 3.7, Python 3.8
nosy: + tim.peters

messages: + msg335235

type: behavior
2019-02-11 15:57:44jaracosetstatus: closed -> open
title: unittest assertEqual reports incorrect location of mismatch -> difflib reports incorrect location of mismatch
messages: + msg335231

resolution: third party ->
stage: resolved ->
2019-02-10 14:58:26xtreaksetnosy: + xtreak
messages: + msg335158
2019-02-10 14:24:56jaracosetmessages: + msg335156
2019-02-10 14:14:19jaracosetstatus: open -> closed
resolution: third party
stage: resolved
2019-02-10 14:14:00jaracosetmessages: + msg335155
2019-02-10 14:11:58jaracocreate