classification
Title: filecmp.cmp documentation does not match actual code
Type: behavior Stage: resolved
Components: Library (Lib) Versions:
process
Status: closed Resolution: out of date
Dependencies: Superseder: filecmp.cmp(shallow=True) isn't actually shallow when only mtime differs
View: 42958
Assigned To: Nosy List: FreeSandwiches, andrei.avk, chanke, lukasz.langa
Priority: normal Keywords: patch

Created on 2020-07-21 07:29 by chanke, last changed 2021-08-04 18:51 by lukasz.langa. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 21580 closed chanke, 2020-07-21 09:26
Messages (6)
msg374054 - (view) Author: Christof Hanke (chanke) * Date: 2020-07-21 07:29
help(filecmp.cmp) says:

"""
cmp(f1, f2, shallow=True)
    Compare two files.
    
    Arguments:
    
    f1 -- First file name
    
    f2 -- Second file name
    
    shallow -- Just check stat signature (do not read the files).
               defaults to True.
    
    Return value:
    
    True if the files are the same, False otherwise.
    
    This function uses a cache for past comparisons and the results,
    with cache entries invalidated if their stat information
    changes.  The cache may be cleared by calling clear_cache().
"""

However, looking at the code, the shallow-argument is taken only into account if the signatures are the same:
"""
    s1 = _sig(os.stat(f1))
    s2 = _sig(os.stat(f2))
    if s1[0] != stat.S_IFREG or s2[0] != stat.S_IFREG:
        return False
    if shallow and s1 == s2:
        return True
    if s1[1] != s2[1]:
        return False

    outcome = _cache.get((f1, f2, s1, s2))
    if outcome is None:
        outcome = _do_cmp(f1, f2)
        if len(_cache) > 100:      # limit the maximum size of the cache
            clear_cache()
        _cache[f1, f2, s1, s2] = outcome
    return outcome
"""

Therefore, if I call cmp with shallow=True and the stat-signatures differ, 
cmp actually does a "deep" compare.
This "deep" compare however does not check the stat-signatures.

Thus I propose follwing patch:
cmp always checks the "full" signature.
return True if shallow and above test passed.
It does not make sense to me that when doing a "deep" compare, that only the size 
is compared, but not the mtime. 


--- filecmp.py.orig     2020-07-16 12:00:57.000000000 +0200
+++ filecmp.py  2020-07-16 12:00:30.000000000 +0200
@@ -52,10 +52,10 @@
     s2 = _sig(os.stat(f2))
     if s1[0] != stat.S_IFREG or s2[0] != stat.S_IFREG:
         return False
-    if shallow and s1 == s2:
-        return True
-    if s1[1] != s2[1]:
+    if s1 != s2:
         return False
+    if shallow:
+        return True
 
     outcome = _cache.get((f1, f2, s1, s2))
     if outcome is None:
msg383191 - (view) Author: Scott (FreeSandwiches) Date: 2020-12-16 17:46
I suggest changing the documentation rather than the code.  The mix up is in the wording.

Documentation below
"If shallow is true, files with identical os.stat() signatures are taken to be equal. Otherwise, the contents of the files are compared."

The "Otherwise" appears to be referencing the "If shallow is true" cause, when it should be referring to the equality of the _sig()s.

Proposed Change 
"If shallow is true, files with identical os.stat() signatures are taken to be equal. If they are not equal, the contents of the files are compared."
msg383885 - (view) Author: Christof Hanke (chanke) * Date: 2020-12-28 07:45
I understand that you are reluctant to change existing code.
But for me as a sysadmin, the current behavior doesn't make sense for two reasons:

* st.st_size is part of _sig.  why would you do a deep compare if  the two files have a different length ?

* comparing thousands of files, a proper shallow-only compare is required, since it takes a long time to compare large files (especially when they are migrated to a tape-backend), so a silent-fallback to a deep-compare is not good.
msg397272 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021-07-12 01:31
Christof: I've left one comment on the PR.

Generally your reasoning (on why unexpected deep cmp is not ideal), makes sense to me.

However, this is a backwards incompatible change that will probably affect a large number of a lot of scripts that quietly run in the background and might silently break (i.e. just by affecting different sets of files, with no reported errors or warnings) with this change.

Can you find other instances where people complained about this behavior e.g. here on BPO or SO?

What do you think about making this change while preserving backwards compatibility e.g. via a new arg?
msg397318 - (view) Author: Christof Hanke (chanke) * Date: 2021-07-12 15:04
Andrei,

See https://bugs.python.org/issue42958

Someone else stumbled over this topic.

Maybe you can merge these two requests?

Otherwise, I'm fine with a new arg.

Christof
msg398933 - (view) Author: Ɓukasz Langa (lukasz.langa) * (Python committer) Date: 2021-08-04 18:51
Closed in favor of BPO-42958. Thanks for your report, Christof!
History
Date User Action Args
2021-08-04 18:51:38lukasz.langasetstatus: open -> closed

superseder: filecmp.cmp(shallow=True) isn't actually shallow when only mtime differs

nosy: + lukasz.langa
messages: + msg398933
resolution: out of date
stage: patch review -> resolved
2021-07-12 15:04:28chankesetmessages: + msg397318
2021-07-12 01:31:37andrei.avksetnosy: + andrei.avk
messages: + msg397272
2020-12-28 07:45:44chankesetmessages: + msg383885
2020-12-16 17:46:24FreeSandwichessetnosy: + FreeSandwiches
messages: + msg383191
2020-07-21 09:26:27chankesetkeywords: + patch
stage: patch review
pull_requests: + pull_request20722
2020-07-21 07:29:03chankecreate