Title: difflib: mention other "problematic" characters in documentation
Type: Stage:
Components: Documentation Versions: Python 3.10, Python 3.9, Python 3.8
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: docs@python, jugmac00, terry.reedy, tim.peters
Priority: normal Keywords:

Created on 2021-04-01 08:09 by jugmac00, last changed 2021-04-06 04:15 by tim.peters.

Pull Requests
URL Status Linked Edit
PR 25132 open jugmac00, 2021-04-01 08:09
Messages (7)
msg389961 - (view) Author: Jürgen Gmach (jugmac00) * Date: 2021-04-01 08:09
In the documentation you can currently read for the "?"-output:

"These lines can be confusing if the sequences contain tab characters."

From first hand experience :-), I can assure it is also very confusing for other types of whitespace characters, such as spaces and line breaks.

I'd like to add the other characters to the documentation.
msg390113 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2021-04-03 02:00
The quote is in the following section.
I do not really understand the previous line "Lines beginning with ‘?’ attempt to guide the eye to intraline differences, and were not present in either input sequence. "  Can you give examples where '?' occurs, with tabs and spaces (newlines would not be within a line).?
msg390115 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2021-04-03 02:26
Lines beginning with "?" are entirely synthetic: they were not present in either input.  So that's what that part means.

I'm not clear on what else could be materially clearer without greatly bloating the text. For example,

>>> d = difflib.Differ()
>>> for L in["abcefghijkl\n"], ["a cxefghijkl\n"]):
	print(L, end="")
- abcefghijkl
?  ^
+ a cxefghijkl
?  ^ +

The "?" lines guide the eye to the places that differ: "b" was replaced by a blank, and "x" was inserted.  The marks on the "?" lines are intended to point out exactly where changes (substitutions, insertions, deletions) occurred.

If the second input had a tab instead of a blank, the "+" wouldn't _appear_ to be under the "x" at all.  It would instead "look like" a long string of blanks was between "a" and "c" in the first input, and the "+" would appear to be under one of them somewhere near the middle of the empty space.

Tough luck. Use tab characters (or any other kind of "goofy" whitespace) in input to visual tools, and you deserve whatever you get :-)
msg390117 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2021-04-03 06:47
After 3+ years of Github I did not remember that B&W diffs use lines with change position markers and in particular that at they (often? always?) start with ?s. IDLE also uses color to mark positions (for syntax errors).  The following would have been clearer to me and likely to people who have never seen such lines.

"Location marker lines beginning with ‘?’ use symbols to guide the eye to intraline differences."

Tim, you seem to still think that tabs are especially problematical. 

Jürgen, without evidence otherwise, I agree with this.  Adding other chars to the sentence would dilute the current focus on tabs.  Hence my request for examples to justify doing so.  Sorry I was not as clear as I could and should have been.
msg390119 - (view) Author: Jürgen Gmach (jugmac00) * Date: 2021-04-03 08:27
First I need to apologize for not providing more info already when I created the issue.

Initially, I did not even plan to create an issue, and thought the PR with the context of the current documentation would be sufficient information.

Thanks for taking your time anyway!

Also, thanks to Tim for explaining the meaning of the question mark in detail. When I read the documentation, I also had to pause a moment to understand the sentence. But I agree with Tim, it is hard to explain it better without getting much more verbose.

My initial reason to read (and then to update) the documentation was an output of pytest, which left me puzzled.

E           AssertionError: assert 'ROOT: No tox...ith_no_t0/p\n' == 'ROOT: No tox..._with_no_t0/p'
E             Skipping 136 identical leading characters in diff, use -v to show
E             - ith_no_t0/p
E             + ith_no_t0/p
E             ?            +

Here is the screenshot and some discussion:

Using a similar snippet as Tim, here is a minimal example:

for L in["abcdefghijkl"], ["abcdefghijkl\n"]):

- abcdefghijkl
+ abcdefghijkl

?             +

Usually, the output is pretty obvious most of the time, so I never actually noticed the question mark - except when whitespace characters are involved.

I was then told that pytest uses difflib, and I was kindly pointed to the Python documentation.

As only the tab character was listed, I thought it would be a good idea to add the other whitespace characters as well.

After Tim's explanation, I see, that tabs could be especially confusing, while all whitespace characters are on a normal level of confusing :-), especially at the end of the diff.

I certainly won't forget what I learned, but maybe my proposal helps one fellow Python user or another.
msg390272 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2021-04-06 03:19
I have an alternate replacement:  "These lines can be confusing if the sequences contain tab characters or other characters that result in the indicator symbols in these lines being mislocated."

Or leave the current sentence as is.

Explanation with the details omitted from the above:
In 3.x, strings are unicode.  Even if one uses a fixed pitch font for the ascii subset, a majority of characters will be rendered either in a different fixed pitch or with variable pitch.  And on a graphics screen that is not simulating a fixed-pitch text terminal (such as Windows console), the so-called double-wide East Asian characters are not really double wide but more like 1.6 times as wide.  The details depend on the OS, the font, and perhaps the font size.  One can explore this in the font sample box for the Font tab of the IDLE settings dialog.  The problems include chars less than 'one space', down to 0 wide.  For general unicode, ^ marking does not work.  Syntax error marking has the same problem and there is no general solution.  

Tab is an example of a character that is either displayed as a variable space or a fixed double space ('\t') or larger.  If we were to make a change, we should mention, as above, that many non-ascii chars are as especially confusing as tabs.

In your example above, the caret at least points to the right space.  It correctly indicates some difference beyond the visible end - a non-visible whitespace difference.
msg390274 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2021-04-06 04:15
Terry, your suggested replacement statement looks like an improvement to me. Perhaps the longer explanation could be placed in a footnote.

Note that I'm old ;-) I grew up on plain old ASCII, decades & decades ago, and tabs are in fact the only "characters" I've had a problem with in doctests. But then, e.g., I never in my life used goofy things like ASCII "form feed" characters, or NUL bytes, or ... in text either.

I don't use Unicode either, except to the extent that Python forces me to when I'm sticking printable ASCII characters inside string quotes ;-)
Date User Action Args
2021-04-06 04:15:36tim.peterssetmessages: + msg390274
2021-04-06 03:19:39terry.reedysetmessages: + msg390272
2021-04-03 08:27:22jugmac00setmessages: + msg390119
2021-04-03 06:47:41terry.reedysetmessages: + msg390117
2021-04-03 02:26:57tim.peterssetmessages: + msg390115
2021-04-03 02:00:40terry.reedysetnosy: + terry.reedy

messages: + msg390113
versions: - Python 3.6, Python 3.7
2021-04-01 09:20:50xtreaksetnosy: + tim.peters
2021-04-01 08:09:11jugmac00create