classification
Title: filecmp.cmp documentation does not match actual code
Type: behavior Stage: patch review
Components: Library (Lib) Versions:
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: FreeSandwiches, chanke
Priority: normal Keywords: patch

Created on 2020-07-21 07:29 by chanke, last changed 2020-12-28 07:45 by chanke.

Pull Requests
URL Status Linked Edit
PR 21580 open chanke, 2020-07-21 09:26
Messages (3)
msg374054 - (view) Author: Christof Hanke (chanke) * Date: 2020-07-21 07:29
help(filecmp.cmp) says:

"""
cmp(f1, f2, shallow=True)
    Compare two files.
    
    Arguments:
    
    f1 -- First file name
    
    f2 -- Second file name
    
    shallow -- Just check stat signature (do not read the files).
               defaults to True.
    
    Return value:
    
    True if the files are the same, False otherwise.
    
    This function uses a cache for past comparisons and the results,
    with cache entries invalidated if their stat information
    changes.  The cache may be cleared by calling clear_cache().
"""

However, looking at the code, the shallow-argument is taken only into account if the signatures are the same:
"""
    s1 = _sig(os.stat(f1))
    s2 = _sig(os.stat(f2))
    if s1[0] != stat.S_IFREG or s2[0] != stat.S_IFREG:
        return False
    if shallow and s1 == s2:
        return True
    if s1[1] != s2[1]:
        return False

    outcome = _cache.get((f1, f2, s1, s2))
    if outcome is None:
        outcome = _do_cmp(f1, f2)
        if len(_cache) > 100:      # limit the maximum size of the cache
            clear_cache()
        _cache[f1, f2, s1, s2] = outcome
    return outcome
"""

Therefore, if I call cmp with shallow=True and the stat-signatures differ, 
cmp actually does a "deep" compare.
This "deep" compare however does not check the stat-signatures.

Thus I propose follwing patch:
cmp always checks the "full" signature.
return True if shallow and above test passed.
It does not make sense to me that when doing a "deep" compare, that only the size 
is compared, but not the mtime. 


--- filecmp.py.orig     2020-07-16 12:00:57.000000000 +0200
+++ filecmp.py  2020-07-16 12:00:30.000000000 +0200
@@ -52,10 +52,10 @@
     s2 = _sig(os.stat(f2))
     if s1[0] != stat.S_IFREG or s2[0] != stat.S_IFREG:
         return False
-    if shallow and s1 == s2:
-        return True
-    if s1[1] != s2[1]:
+    if s1 != s2:
         return False
+    if shallow:
+        return True
 
     outcome = _cache.get((f1, f2, s1, s2))
     if outcome is None:
msg383191 - (view) Author: Scott (FreeSandwiches) Date: 2020-12-16 17:46
I suggest changing the documentation rather than the code.  The mix up is in the wording.

Documentation below
"If shallow is true, files with identical os.stat() signatures are taken to be equal. Otherwise, the contents of the files are compared."

The "Otherwise" appears to be referencing the "If shallow is true" cause, when it should be referring to the equality of the _sig()s.

Proposed Change 
"If shallow is true, files with identical os.stat() signatures are taken to be equal. If they are not equal, the contents of the files are compared."
msg383885 - (view) Author: Christof Hanke (chanke) * Date: 2020-12-28 07:45
I understand that you are reluctant to change existing code.
But for me as a sysadmin, the current behavior doesn't make sense for two reasons:

* st.st_size is part of _sig.  why would you do a deep compare if  the two files have a different length ?

* comparing thousands of files, a proper shallow-only compare is required, since it takes a long time to compare large files (especially when they are migrated to a tape-backend), so a silent-fallback to a deep-compare is not good.
History
Date User Action Args
2020-12-28 07:45:44chankesetmessages: + msg383885
2020-12-16 17:46:24FreeSandwichessetnosy: + FreeSandwiches
messages: + msg383191
2020-07-21 09:26:27chankesetkeywords: + patch
stage: patch review
pull_requests: + pull_request20722
2020-07-21 07:29:03chankecreate