Message 302897 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients	Mahmoud Al-Qudsi, rhettinger, tim.peters
Date	2017-09-25.00:10:45
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1506298247.65.0.740561651282.issue31561@psf.upfronthosting.co.za>
In-reply-to

Content
The text/binary distinction you have in mind doesn't particularly apply to difflib: it compares sequences of hashable objects. "Text files" are typically converted by front ends to lists of strings, but, e.g., the engine is just as happy comparing tuples of floats. File comparison interfaces typically do this at _two_ levels: first, viewing files as lists of strings (one string per file line). Then, when two blocks of mismatching lines are encountered, viewing the lines as sequences of characters. The only role "line endings" play in any of this is in how the _input_ to the difference engine is created: all decisions about how a file is broken into strings are made before the difference engine is consulted. This preprocessing can choose to normalize line endings, leave them exactly as-is (typical), or remove them entirely from the strings it presents to the difference engine - or anything else it feels like doing. The engine itself has no concept of "line termination sequences" - if there happen to be \r\n, \n, \r, or \0 substrings in strings passed to it, they're treated exactly the same as any other characters. If the input processing creates lists of lines A and B for two files, where the files have different line-end terminators which are left in the strings, then no exact match whatsoever is possible between any line of A and a line in B. You suggest to just skip over both then, but the main text-file-comparison "front end" in difflib works hard to try to do better than that. That's "a feature", albeit a potentially expensive one. Viewing the file lines as sequences of characters, it computes a "similarity score" for _every_ line in A compared to _every_ line in B. So len(A)*len(B) such scores are computed. The pair with the highest score (assuming it exceeds a cutoff value) is taken as being the synch point, and then it can go on to show the _intra_line differences between those two lines. That's why, e.g., given the lists of "lines": A = ["eggrolls", "a a a", "b bb"] B = ["ccc", "dd d", "egg rolls"] it can (and does) tell you that the `egg rolls` in B was plausibly obtained from the `eggrolls` in A by inserting a blank. This is often much more helpful than just giving up, saying "well, no line in A matched any line in B, so we'll just say A was entirely replaced by B". That would be "correct" too - and much faster - but not really helpful. Of course there's nothing special about the blank character in that. Exactly the same applies if the line terminators differ between the files, and input processing leaves them in the strings. difflib doesn't give up just because there are no exact line-level matches, and the same expensive "similarity score" algorithm kicks in to find the "most similar" lines despite the lack of exact matches. Since that's a feature (albeit potentially expensive), I agree with Raymond closing this. You can, of course, avoid the expense by ensuring your files all use the same line terminator sequence to begin with. Which is the one obvious & thoroughly sane approach ;-) Alternatively, talk to the `icdiff` author(s). I noticed it opens files for reading in binary mode, guaranteeing that different line-end conventions will be visible. It's possible they could be talked into opening text files (or add an option to do so) using Python's "universal newline" mode, which converts all instances of \n, \r\n, and \r to \n on input. Then lines that are identical except for line-end convention would in fact appear identical to difflib, and so skip the expensive similarity computations whenever that's so.

The text/binary distinction you have in mind doesn't particularly apply to difflib:  it compares sequences of hashable objects.  "Text files" are typically converted by front ends to lists of strings, but, e.g., the engine is just as happy comparing tuples of floats.

File comparison interfaces typically do this at _two_ levels:  first, viewing files as lists of strings (one string per file line).  Then, when two blocks of mismatching lines are encountered, viewing the lines as sequences of characters.  The only role "line endings" play in any of this is in how the _input_ to the difference engine is created:  all decisions about how a file is broken into strings are made before the difference engine is consulted.  This preprocessing can choose to normalize line endings, leave them exactly as-is (typical), or remove them entirely from the strings it presents to the difference engine - or anything else it feels like doing.  The engine itself has no concept of "line termination sequences" - if there happen to be \r\n, \n, \r, or \0 substrings in strings passed to it, they're treated exactly the same as any other characters.

If the input processing creates lists of lines A and B for two files, where the files have different line-end terminators which are left in the strings, then no exact match whatsoever is possible between any line of A and a line in B.  You suggest to just skip over both then, but the main text-file-comparison "front end" in difflib works hard to try to do better than that.  That's "a feature", albeit a potentially expensive one.  Viewing the file lines as sequences of characters, it computes a "similarity score" for _every_ line in A compared to _every_ line in B.  So len(A)*len(B) such scores are computed.  The pair with the highest score (assuming it exceeds a cutoff value) is taken as being the synch point, and then it can go on to show the _intra_line differences between those two lines.

That's why, e.g., given the lists of "lines":

A = ["eggrolls", "a a a", "b bb"]
B = ["ccc", "dd d", "egg rolls"]

it can (and does) tell you that the `egg rolls` in B was plausibly obtained from the `eggrolls` in A by inserting a blank.  This is often much more helpful than just giving up, saying "well, no line in A matched any line in B, so we'll just say A was entirely replaced by B".  That would be "correct" too - and much faster - but not really helpful.

Of course there's nothing special about the blank character in that.  Exactly the same applies if the line terminators differ between the files, and input processing leaves them in the strings.  difflib doesn't give up just because there are no exact line-level matches, and the same expensive "similarity score" algorithm kicks in to find the "most similar" lines despite the lack of exact matches.

Since that's a feature (albeit potentially expensive), I agree with Raymond closing this.  You can, of course, avoid the expense by ensuring your files all use the same line terminator sequence to begin with.  Which is the one obvious & thoroughly sane approach ;-)  Alternatively, talk to the `icdiff` author(s).  I noticed it opens files for reading in binary mode, guaranteeing that different line-end conventions will be visible.  It's possible they could be talked into opening text files (or add an option to do so) using Python's "universal newline" mode, which converts all instances of \n, \r\n, and \r to \n on input.  Then lines that are identical except for line-end convention would in fact appear identical to difflib, and so skip the expensive similarity computations whenever that's so.

History
Date	User	Action	Args
2017-09-25 00:10:47	tim.peters	set	recipients: + tim.peters, rhettinger, Mahmoud Al-Qudsi
2017-09-25 00:10:47	tim.peters	set	messageid: <1506298247.65.0.740561651282.issue31561@psf.upfronthosting.co.za>
2017-09-25 00:10:47	tim.peters	link	issue31561 messages
2017-09-25 00:10:45	tim.peters	create