classification
Title: unittest assertSequenceEqual can lead to Difflib.compare() crashing on mostly different sequences
Type: crash Stage:
Components: Library (Lib) Versions: Python 3.7, Python 3.6, Python 3.5, Python 3.4, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: gregory.p.smith, jftuga, levkivskyi, nnja, r.david.murray, terry.reedy, tim.peters
Priority: normal Keywords:

Created on 2014-04-16 15:54 by nnja, last changed 2017-03-18 13:23 by levkivskyi.

Files
File name Uploaded Description Edit
diff_bug34.py nnja, 2014-04-16 15:54
Messages (6)
msg216479 - (view) Author: Nina Zakharenko (nnja) Date: 2014-04-16 15:54
When difflib.compare() is used on two moderately large sequences with little or no common elements, a RuntimeError: maximum recursion depth exceeded occurs. 

This error became apparent when testing another bug (see: issue 19217) in the AssertEquals() method of the unit test library.

A sample program to reproduce this issue in 3.4 is attached. To repo in 2.7 remove the list() wrapper from the range call.
msg216809 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014-04-18 21:38
An obvious fix for the recursion limit error is to convert the .compare recursion to iteration with a stack (which I could try to do), but I don't know if the deep recursion is expected in this case, or is a bug. Tim?
msg216816 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2014-04-18 23:52
It appears to devolve into linear recursion in this case, one per each item in one of the sequences being searched for a match, so even using a stack seems wrong as it'd still be linear (though it would prevent the recursion depth problem).

The mutual _fancy_replace + _fancy_helper linear recursion comes from http://hg.python.org/cpython/file/604b74f9a07d/Lib/difflib.py#l1021
msg216830 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2014-04-19 01:35
Comparison can certainly trigger a recursion error if the sequences contain no (or few) matching non-junk elements, but contain many "almost matching" elements.  If the sequences have lengths M and N, recursion can go as deep as 2*min(M, N) then.

Now in the test case, we have two lists of integers.  Difflib has no idea what "almost match" might mean for integers.  But difflib isn't passed two lists of integers.  Instead unittest appears to be converting the input lists to giant strings, then splitting the giant strings on whitespace (or just linefeeds?), and then feeding the resulting lists of substrings to difflib.  That doesn't make much sense to me, but so it goes.

There are no matches in the two lists of strings, so difflib starts looking for "close matches", and there are a lot of these.

At first it decides "[1," and "[100," aren't close enough, but " 10," and " 101," are close enough.  That's used as a synch point, and then there's recursion to match the sublists before and after the synch point.  Then " 12," and " 102," are close enough, so that pair is used as the next synch point, and another layer of 2-sided recursion.  Etc.

Whether someone wants to rip the recursion out of _fancy_replace and _fancy_helper is up to them.  I wouldn't bother, if this unittest-created problem is the only reported instance.  Comparing strings seems a poor idea from the start (there's no guarantee in general, e.g., that A != B implies str(A) != str(B) or repr(A) != repr(B)), and difflib isn't good in any case at comparing sequences with few matching elements (e.g., remove the recursion and it will still take time at best cubic in the common length of the sequences - would it really help to change "a failing unittest bombs with RecursionError" to "a failing unittest seems to take forever"?).

I'd suggest instead that unittest, say, locate the first pair of non-equal elements itself, and display that along with a few elements of context on either side.  Or something ;-)  Something worst-case linear-time, and using != directly on sequence elements (not on strings derived from the sequence elements).
msg216835 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2014-04-19 02:04
that seems reasonable.  unittest's assertSequenceEqual is using this to attempt to display a useful error message as to what the delta was; it should try harder to avoid difflib corner cases.

At the very least, unittest should recover from a difflib failure and report a test failure without the possibly nicer message.
msg248905 - (view) Author: John Taylor (jftuga) * Date: 2015-08-20 17:44
I am seeing something similar in difflib.HtmlDiff.make_file() under Python 3.4.3 (windows 7).  Do I need to file a separate bug report?


File "H:\test\test.py", line 522, in print_differ
    diff = html.make_file(file1_data,file2_data,"dir 1","dir 2",True)
  File "C:\Python34\lib\difflib.py", line 1713, in make_file
    context=context,numlines=numlines))
  File "C:\Python34\lib\difflib.py", line 1962, in make_table
    fromlist,tolist,flaglist = self._collect_lines(diffs)
  File "C:\Python34\lib\difflib.py", line 1830, in _collect_lines
    for fromdata,todata,flag in diffs:
  File "C:\Python34\lib\difflib.py", line 1806, in _line_wrapper
    self._split_line(fromlist,fromline,fromtext)
  File "C:\Python34\lib\difflib.py", line 1791, in _split_line
    self._split_line(data_list,'>',line2)
  
  (repeated many times)

  File "C:\Python34\lib\difflib.py", line 1791, in _split_line
    self._split_line(data_list,'>',line2)
  File "C:\Python34\lib\difflib.py", line 1755, in _split_line
    if (size <= max) or ((size -(text.count('\0')*3)) <= max):
RuntimeError: maximum recursion depth exceeded in comparison
History
Date User Action Args
2017-03-18 13:23:59levkivskyisetversions: + Python 3.5, Python 3.6, Python 3.7
2015-08-20 17:44:35jftugasetnosy: + jftuga
messages: + msg248905
2015-06-29 12:10:48levkivskyisetnosy: + levkivskyi
2014-04-19 02:04:23gregory.p.smithsetmessages: + msg216835
title: Difflib.compare() crashes on mostly different sequences -> unittest assertSequenceEqual can lead to Difflib.compare() crashing on mostly different sequences
2014-04-19 01:35:24tim.peterssetmessages: + msg216830
2014-04-18 23:52:20gregory.p.smithsetmessages: + msg216816
2014-04-18 21:38:51terry.reedysetnosy: + terry.reedy, tim.peters

messages: + msg216809
title: Difflib.compare() crashes when sequences contain little or no common elements -> Difflib.compare() crashes on mostly different sequences
2014-04-16 15:55:44nnjasettitle: Difflib -> Difflib.compare() crashes when sequences contain little or no common elements
2014-04-16 15:54:46nnjacreate