classification
Title: filecmp.cmp's "shallow" option
Type: behavior Stage: test needed
Components: Extension Modules Versions: Python 3.3, Python 3.2, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: fdrake Nosy List: fdrake, georg.brandl, goatsofmendez, pvrijlandt, rhettinger, sandro.tosi
Priority: normal Keywords:

Created on 2005-07-08 09:01 by goatsofmendez, last changed 2011-10-31 10:04 by sandro.tosi.

Messages (7)
msg25751 - (view) Author: Mendez (goatsofmendez) Date: 2005-07-08 09:01
The filecmp.cmp function has a shallow option (set as
default) to only compare files based on stats rather
than doing a bit by bit comparison of the file itself.
The relevant bit of the code follows.

    s1 = _sig(os.stat(f1))
    s2 = _sig(os.stat(f2))
    if s1[0] != stat.S_IFREG or s2[0] != stat.S_IFREG:
        return False
    if shallow and s1 == s2:
        return True
    if s1[1] != s2[1]:
        return False

    result = _cache.get((f1, f2))
    if result and (s1, s2) == result[:2]:
        return result[2]
    outcome = _do_cmp(f1, f2)
    _cache[f1, f2] = s1, s2, outcome
    return outcome

There's a check to see if the shallow mode is enabled
and if that's the case and the stats match it returns
true but the test for returning false is for only one
of the stats attributes meaning that it's possible for
a file not to match either test and the function to
carry on to the full file check lower down.

The check above to see if the stats match with
stat.S_IFREG also looks wrong to me, but it could just
be I don't understand what he's trying to do :)

This code is the same in both 2.3 and 2.4
msg25752 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2005-08-26 08:19
Logged In: YES 
user_id=1188172

Hm. Looks like if the size is identical, but the mtime is
not, the file will be read even in shallow mode.

The filecmp docs say "Unless shallow is given and is false,
files with identical os.stat() signatures are taken to be
equal."
The filecmp.cmp docstring says "shallow: Just check stat
signature (do not read the files)"

Two questions arise:
- Should the file contents be compared in shallow mode if
the mtimes do not match?
- Should the mtimes matter in non-shallow mode?
msg25753 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2005-08-26 08:33
Logged In: YES 
user_id=80475

Fred, do you have thoughts on this patch?
msg25754 - (view) Author: Mendez (goatsofmendez) Date: 2005-08-26 11:38
Logged In: YES 
user_id=1309441

On my own system I've modified the testing code as follows

    if s1 != s2:
        return False

    if shallow and s1 == s2:
        return True

Which works as I expected it to work

If attributes aren't identical - Always fails
If the attributes are identical and it's in shallow mode - returns 
true
If the attributes are identical and it's not in shallow mode - 
goes on to check if the files are byte identical.

Whether there should be additional modes for finding byte 
identical files with different names, attributes etc. is another 
matter.
msg25755 - (view) Author: Patrick Vrijlandt (pvrijlandt) Date: 2005-08-28 20:24
Logged In: YES 
user_id=1307917

The cache is buggy too; it can be fooled by changing the cwd 
between calls. Also, if it caches (a, b) it does not cache (b, 
a). Is this caching feature really useful? Maybe it's up to the 
caller to record what comparisons have been made. If a 
caching function is to be provided, it should recognize the (b, 
a) case and maybe also infer from (a, b) and (b, c) about (a, 
c). (My programs remember an md5-signature)
I propose to eliminate the caching feature.

Also I propose to publish cmp_shallow and cmp_bytes which 
I think will make the unit easier to understand and verify.
cmp_bytes is of course _do_cmp, with a check added for 
length equality.
Then, the docs should be updated to whatever 
implementation for cmp is chosen, because non-shallow 
comparison is ill-specified.

cmp should return a bool.

what should cmp do with funny input like a dir? It now returns 
false. Would an exception be better?

BTW: is it really useful in _do_cmp to have a local variable 
bufsize? is 8k still optimal?
msg25756 - (view) Author: Patrick Vrijlandt (pvrijlandt) Date: 2005-08-28 22:16
Logged In: YES 
user_id=1307917

"cmp should return a bool" refers to the modules docstring. 
On other places, it does already.

The default value for "shallow" should be a bool too.
msg146682 - (view) Author: Sandro Tosi (sandro.tosi) * (Python committer) Date: 2011-10-31 10:04
Hello,
we recently received a this message on docs@ : http://mail.python.org/pipermail/docs/2011-October/006121.html that's actually related to this issue: how can we move it forward?
History
Date User Action Args
2011-10-31 10:04:54sandro.tosisetnosy: + sandro.tosi

messages: + msg146682
versions: + Python 3.3, - Python 3.1
2010-08-21 18:27:35BreamoreBoysetversions: + Python 3.1, Python 2.7, Python 3.2, - Python 2.6
2009-02-16 00:27:32ajaksu2setstage: test needed
type: behavior
versions: + Python 2.6, - Python 2.4
2005-07-08 09:01:20goatsofmendezcreate