Message 216830 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients	gregory.p.smith, nnja, r.david.murray, terry.reedy, tim.peters
Date	2014-04-19.01:35:22
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1397871324.49.0.975166389939.issue21253@psf.upfronthosting.co.za>
In-reply-to

Content
Comparison can certainly trigger a recursion error if the sequences contain no (or few) matching non-junk elements, but contain many "almost matching" elements. If the sequences have lengths M and N, recursion can go as deep as 2*min(M, N) then. Now in the test case, we have two lists of integers. Difflib has no idea what "almost match" might mean for integers. But difflib isn't passed two lists of integers. Instead unittest appears to be converting the input lists to giant strings, then splitting the giant strings on whitespace (or just linefeeds?), and then feeding the resulting lists of substrings to difflib. That doesn't make much sense to me, but so it goes. There are no matches in the two lists of strings, so difflib starts looking for "close matches", and there are a lot of these. At first it decides "[1," and "[100," aren't close enough, but " 10," and " 101," are close enough. That's used as a synch point, and then there's recursion to match the sublists before and after the synch point. Then " 12," and " 102," are close enough, so that pair is used as the next synch point, and another layer of 2-sided recursion. Etc. Whether someone wants to rip the recursion out of _fancy_replace and _fancy_helper is up to them. I wouldn't bother, if this unittest-created problem is the only reported instance. Comparing strings seems a poor idea from the start (there's no guarantee in general, e.g., that A != B implies str(A) != str(B) or repr(A) != repr(B)), and difflib isn't good in any case at comparing sequences with few matching elements (e.g., remove the recursion and it will still take time at best cubic in the common length of the sequences - would it really help to change "a failing unittest bombs with RecursionError" to "a failing unittest seems to take forever"?). I'd suggest instead that unittest, say, locate the first pair of non-equal elements itself, and display that along with a few elements of context on either side. Or something ;-) Something worst-case linear-time, and using != directly on sequence elements (not on strings derived from the sequence elements).

Comparison can certainly trigger a recursion error if the sequences contain no (or few) matching non-junk elements, but contain many "almost matching" elements. If the sequences have lengths M and N, recursion can go as deep as 2*min(M, N) then.

Now in the test case, we have two lists of integers. Difflib has no idea what "almost match" might mean for integers. But difflib isn't passed two lists of integers. Instead unittest appears to be converting the input lists to giant strings, then splitting the giant strings on whitespace (or just linefeeds?), and then feeding the resulting lists of substrings to difflib. That doesn't make much sense to me, but so it goes.

There are no matches in the two lists of strings, so difflib starts looking for "close matches", and there are a lot of these.

At first it decides "[1," and "[100," aren't close enough, but " 10," and " 101," are close enough. That's used as a synch point, and then there's recursion to match the sublists before and after the synch point. Then " 12," and " 102," are close enough, so that pair is used as the next synch point, and another layer of 2-sided recursion. Etc.

Whether someone wants to rip the recursion out of _fancy_replace and _fancy_helper is up to them. I wouldn't bother, if this unittest-created problem is the only reported instance. Comparing strings seems a poor idea from the start (there's no guarantee in general, e.g., that A != B implies str(A) != str(B) or repr(A) != repr(B)), and difflib isn't good in any case at comparing sequences with few matching elements (e.g., remove the recursion and it will still take time at best cubic in the common length of the sequences - would it really help to change "a failing unittest bombs with RecursionError" to "a failing unittest seems to take forever"?).

I'd suggest instead that unittest, say, locate the first pair of non-equal elements itself, and display that along with a few elements of context on either side. Or something ;-) Something worst-case linear-time, and using != directly on sequence elements (not on strings derived from the sequence elements).

History
Date	User	Action	Args
2014-04-19 01:35:24	tim.peters	set	recipients: + tim.peters, terry.reedy, gregory.p.smith, r.david.murray, nnja
2014-04-19 01:35:24	tim.peters	set	messageid: <1397871324.49.0.975166389939.issue21253@psf.upfronthosting.co.za>
2014-04-19 01:35:24	tim.peters	link	issue21253 messages
2014-04-19 01:35:22	tim.peters	create