Issue 43689: difflib: mention other "problematic" characters in documentation

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/87855

classification

Title:	difflib: mention other "problematic" characters in documentation
Type:		Stage:
Components:	Documentation	Versions:	Python 3.10, Python 3.9, Python 3.8

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	docs@python, jugmac00, terry.reedy, tim.peters
Priority:	normal	Keywords:

Created on 2021-04-01 08:09 by jugmac00, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL	Status	Linked	Edit
PR 25132	open	jugmac00, 2021-04-01 08:09

Messages (7)
msg389961 - (view)	Author: Jürgen Gmach (jugmac00) *	Date: 2021-04-01 08:09
In the documentation you can currently read for the "?"-output: "These lines can be confusing if the sequences contain tab characters." From first hand experience :-), I can assure it is also very confusing for other types of whitespace characters, such as spaces and line breaks. I'd like to add the other characters to the documentation.
msg390113 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2021-04-03 02:00
The quote is in the following section. https://docs.python.org/3/library/difflib.html#difflib.Differ I do not really understand the previous line "Lines beginning with ‘?’ attempt to guide the eye to intraline differences, and were not present in either input sequence. " Can you give examples where '?' occurs, with tabs and spaces (newlines would not be within a line).?
msg390115 - (view)	Author: Tim Peters (tim.peters) *	Date: 2021-04-03 02:26
Lines beginning with "?" are entirely synthetic: they were not present in either input. So that's what that part means. I'm not clear on what else could be materially clearer without greatly bloating the text. For example, >>> d = difflib.Differ() >>> for L in d.compare(["abcefghijkl\n"], ["a cxefghijkl\n"]): print(L, end="") - abcefghijkl ? ^ + a cxefghijkl ? ^ + The "?" lines guide the eye to the places that differ: "b" was replaced by a blank, and "x" was inserted. The marks on the "?" lines are intended to point out exactly where changes (substitutions, insertions, deletions) occurred. If the second input had a tab instead of a blank, the "+" wouldn't _appear_ to be under the "x" at all. It would instead "look like" a long string of blanks was between "a" and "c" in the first input, and the "+" would appear to be under one of them somewhere near the middle of the empty space. Tough luck. Use tab characters (or any other kind of "goofy" whitespace) in input to visual tools, and you deserve whatever you get :-)
msg390117 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2021-04-03 06:47
After 3+ years of Github I did not remember that B&W diffs use lines with change position markers and in particular that at they (often? always?) start with ?s. IDLE also uses color to mark positions (for syntax errors). The following would have been clearer to me and likely to people who have never seen such lines. "Location marker lines beginning with ‘?’ use symbols to guide the eye to intraline differences." Tim, you seem to still think that tabs are especially problematical. Jürgen, without evidence otherwise, I agree with this. Adding other chars to the sentence would dilute the current focus on tabs. Hence my request for examples to justify doing so. Sorry I was not as clear as I could and should have been.
msg390119 - (view)	Author: Jürgen Gmach (jugmac00) *	Date: 2021-04-03 08:27
First I need to apologize for not providing more info already when I created the issue. Initially, I did not even plan to create an issue, and thought the PR with the context of the current documentation would be sufficient information. Thanks for taking your time anyway! Also, thanks to Tim for explaining the meaning of the question mark in detail. When I read the documentation, I also had to pause a moment to understand the sentence. But I agree with Tim, it is hard to explain it better without getting much more verbose. My initial reason to read (and then to update) the documentation was an output of pytest, which left me puzzled. E AssertionError: assert 'ROOT: No tox...ith_no_t0/p\n' == 'ROOT: No tox..._with_no_t0/p' E Skipping 136 identical leading characters in diff, use -v to show E - ith_no_t0/p E + ith_no_t0/p E ? + Here is the screenshot and some discussion: https://twitter.com/jugmac00/status/1377317886419738624 Using a similar snippet as Tim, here is a minimal example: for L in d.compare(["abcdefghijkl"], ["abcdefghijkl\n"]): print(L) - abcdefghijkl + abcdefghijkl ? + Usually, the output is pretty obvious most of the time, so I never actually noticed the question mark - except when whitespace characters are involved. I was then told that pytest uses difflib, and I was kindly pointed to the Python documentation. As only the tab character was listed, I thought it would be a good idea to add the other whitespace characters as well. After Tim's explanation, I see, that tabs could be especially confusing, while all whitespace characters are on a normal level of confusing :-), especially at the end of the diff. I certainly won't forget what I learned, but maybe my proposal helps one fellow Python user or another.
msg390272 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2021-04-06 03:19
I have an alternate replacement: "These lines can be confusing if the sequences contain tab characters or other characters that result in the indicator symbols in these lines being mislocated." Or leave the current sentence as is. Explanation with the details omitted from the above: In 3.x, strings are unicode. Even if one uses a fixed pitch font for the ascii subset, a majority of characters will be rendered either in a different fixed pitch or with variable pitch. And on a graphics screen that is not simulating a fixed-pitch text terminal (such as Windows console), the so-called double-wide East Asian characters are not really double wide but more like 1.6 times as wide. The details depend on the OS, the font, and perhaps the font size. One can explore this in the font sample box for the Font tab of the IDLE settings dialog. The problems include chars less than 'one space', down to 0 wide. For general unicode, ^ marking does not work. Syntax error marking has the same problem and there is no general solution. Tab is an example of a character that is either displayed as a variable space or a fixed double space ('\t') or larger. If we were to make a change, we should mention, as above, that many non-ascii chars are as especially confusing as tabs. In your example above, the caret at least points to the right space. It correctly indicates some difference beyond the visible end - a non-visible whitespace difference.
msg390274 - (view)	Author: Tim Peters (tim.peters) *	Date: 2021-04-06 04:15
Terry, your suggested replacement statement looks like an improvement to me. Perhaps the longer explanation could be placed in a footnote. Note that I'm old ;-) I grew up on plain old ASCII, decades & decades ago, and tabs are in fact the only "characters" I've had a problem with in doctests. But then, e.g., I never in my life used goofy things like ASCII "form feed" characters, or NUL bytes, or ... in text either. I don't use Unicode either, except to the extent that Python forces me to when I'm sticking printable ASCII characters inside string quotes ;-)

History
Date	User	Action	Args
2022-04-11 14:59:43	admin	set	github: 87855
2021-04-06 04:15:36	tim.peters	set	messages: + msg390274
2021-04-06 03:19:39	terry.reedy	set	messages: + msg390272
2021-04-03 08:27:22	jugmac00	set	messages: + msg390119
2021-04-03 06:47:41	terry.reedy	set	messages: + msg390117
2021-04-03 02:26:57	tim.peters	set	messages: + msg390115
2021-04-03 02:00:40	terry.reedy	set	nosy: + terry.reedy messages: + msg390113 versions: - Python 3.6, Python 3.7
2021-04-01 09:20:50	xtreak	set	nosy: + tim.peters
2021-04-01 08:09:11	jugmac00	create