Issue 31561: difflib pathological behavior with mixed line endings

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/75742

classification

Title:	difflib pathological behavior with mixed line endings
Type:		Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.6

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	Mahmoud Al-Qudsi, rhettinger, tim.peters
Priority:	normal	Keywords:

Created on 2017-09-23 17:01 by Mahmoud Al-Qudsi, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
file1	Mahmoud Al-Qudsi, 2017-09-23 17:01
file2	Mahmoud Al-Qudsi, 2017-09-23 17:01

Messages (6)
msg302788 - (view)	Author: Mahmoud Al-Qudsi (Mahmoud Al-Qudsi)	Date: 2017-09-23 17:01
While using the icdiff command line interface to difflib, I ran into an interesting issue where difflib took 47 seconds to compare two simple text documents (a PHP source code file that had been refactored via phptidy). On subsequent analysis, it turned out to be some sort of pathological behavior triggered by the presence of mixed line endings. Normalizing the line endings in both files to \r\n via unix2dos and then comparing (making no other changes) resulted in the diff calculation completing in under 2 seconds. I have attached the documents in question (file1 and file2) to this bug report.
msg302789 - (view)	Author: Mahmoud Al-Qudsi (Mahmoud Al-Qudsi)	Date: 2017-09-23 17:01
Attaching file2
msg302828 - (view)	Author: Tim Peters (tim.peters) *	Date: 2017-09-24 04:13
I'm not familiar with `icdiff` - it's not part of the Python distribution, right? If so, you should really talk to its author(s). If two files have different line endings, then no pair of lines between them can be equal, and the difference engine will struggle mightily to find the "best matching" pair of lines between them, over & over & over ... again. That's expected, and is expensive (roughly cubic in the number of lines). So "don't do that" is the obvious answer. If you think it's something that comes up often enough that `icdiff` should do magic to worm around it, they could - with enough pain - normalize line endings to a canonical form, and then post-process the diff engine's output to say that some lines it thought were equal actually weren't. Or you maybe you want them treated as equal despite that the line endings don't match? Nobody can guess, so at best another option would need to be added to specify which. That's messy all around, so they'll probably resist. Just like I'd mightily resist mucking up difflib to do that itself ;-) Seriously, if you're messing with files containing a mix of line endings, you're begging for surprises from all sorts of tools.
msg302881 - (view)	Author: Mahmoud Al-Qudsi (Mahmoud Al-Qudsi)	Date: 2017-09-24 19:27
@tim.peters No, `icdiff` is not part of core and probably should be omitted from the remainder of this discussion. I just checked and it's actually not a mix of line endings in each file, it's just that one file is \n and the other is \r\n You can actually just duplicate this bug by taking _any_ file and copying it, then executing `unix2dos file1; dos2unix file2` - you'll have to perfectly "correct" files2 that difflib will struggle to handle. (as a preface to what follows, I've written a binary diff and incremental backup utility, so I'm familiar with the intricacies and pitfalls when it comes to diffing. I have not looked at difflib's source code, however. Looking at the documentation for difflib, it's not clear whether or not it should be considered a naive binary diffing utility, since it does seem to have the concept of "lines".) Given that _both_ input files are "correct" without line ending errors, I think the correct optimization here would be for difflib to "realize" that two chunks are "identical" but with different line endings (aka just plain different, not asking for this to be treated as a special case) but instead of going on to search for a match to either buffer, it should assume that no better match will be found later on and simply move on to the next block/chunk. Of course, in the event where file2 has a line from file1 that is first present with a different line ending then repeated with the same line ending, difflib will not choose the correct line.. but that's probably not something worth fretting over (like you said, mixed line endings == recipe for disaster). Of course I can understand if all this is out of the scope of difflib and not an endeavor worth taking up.
msg302896 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2017-09-25 00:09
> Of course I can understand if all this is out of the scope of difflib and not an endeavor worth taking up. I agree with that sentiment. Data normalization for comparability belongs upstream from difflib (i.e. normalizing line-endings, unicode normalization, case folding, etc). Difflib's job is to compute a diff.
msg302897 - (view)	Author: Tim Peters (tim.peters) *	Date: 2017-09-25 00:10
The text/binary distinction you have in mind doesn't particularly apply to difflib: it compares sequences of hashable objects. "Text files" are typically converted by front ends to lists of strings, but, e.g., the engine is just as happy comparing tuples of floats. File comparison interfaces typically do this at _two_ levels: first, viewing files as lists of strings (one string per file line). Then, when two blocks of mismatching lines are encountered, viewing the lines as sequences of characters. The only role "line endings" play in any of this is in how the _input_ to the difference engine is created: all decisions about how a file is broken into strings are made before the difference engine is consulted. This preprocessing can choose to normalize line endings, leave them exactly as-is (typical), or remove them entirely from the strings it presents to the difference engine - or anything else it feels like doing. The engine itself has no concept of "line termination sequences" - if there happen to be \r\n, \n, \r, or \0 substrings in strings passed to it, they're treated exactly the same as any other characters. If the input processing creates lists of lines A and B for two files, where the files have different line-end terminators which are left in the strings, then no exact match whatsoever is possible between any line of A and a line in B. You suggest to just skip over both then, but the main text-file-comparison "front end" in difflib works hard to try to do better than that. That's "a feature", albeit a potentially expensive one. Viewing the file lines as sequences of characters, it computes a "similarity score" for _every_ line in A compared to _every_ line in B. So len(A)*len(B) such scores are computed. The pair with the highest score (assuming it exceeds a cutoff value) is taken as being the synch point, and then it can go on to show the _intra_line differences between those two lines. That's why, e.g., given the lists of "lines": A = ["eggrolls", "a a a", "b bb"] B = ["ccc", "dd d", "egg rolls"] it can (and does) tell you that the `egg rolls` in B was plausibly obtained from the `eggrolls` in A by inserting a blank. This is often much more helpful than just giving up, saying "well, no line in A matched any line in B, so we'll just say A was entirely replaced by B". That would be "correct" too - and much faster - but not really helpful. Of course there's nothing special about the blank character in that. Exactly the same applies if the line terminators differ between the files, and input processing leaves them in the strings. difflib doesn't give up just because there are no exact line-level matches, and the same expensive "similarity score" algorithm kicks in to find the "most similar" lines despite the lack of exact matches. Since that's a feature (albeit potentially expensive), I agree with Raymond closing this. You can, of course, avoid the expense by ensuring your files all use the same line terminator sequence to begin with. Which is the one obvious & thoroughly sane approach ;-) Alternatively, talk to the `icdiff` author(s). I noticed it opens files for reading in binary mode, guaranteeing that different line-end conventions will be visible. It's possible they could be talked into opening text files (or add an option to do so) using Python's "universal newline" mode, which converts all instances of \n, \r\n, and \r to \n on input. Then lines that are identical except for line-end convention would in fact appear identical to difflib, and so skip the expensive similarity computations whenever that's so.

History
Date	User	Action	Args
2022-04-11 14:58:52	admin	set	github: 75742
2017-09-25 00:10:47	tim.peters	set	messages: + msg302897
2017-09-25 00:09:02	rhettinger	set	status: open -> closed nosy: + rhettinger messages: + msg302896 resolution: not a bug stage: resolved
2017-09-24 19:27:19	Mahmoud Al-Qudsi	set	messages: + msg302881
2017-09-24 04:13:24	tim.peters	set	messages: + msg302828
2017-09-23 21:23:36	rhettinger	set	nosy: + tim.peters
2017-09-23 17:01:33	Mahmoud Al-Qudsi	set	files: + file2 messages: + msg302789
2017-09-23 17:01:19	Mahmoud Al-Qudsi	create