Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filecmp.cmp(shallow=True) isn't actually shallow when only mtime differs #87124

Closed
AlexVndnblcke mannequin opened this issue Jan 18, 2021 · 14 comments
Closed

filecmp.cmp(shallow=True) isn't actually shallow when only mtime differs #87124

AlexVndnblcke mannequin opened this issue Jan 18, 2021 · 14 comments
Labels
3.9 only security fixes 3.10 only security fixes 3.11 only security fixes docs Documentation in the Doc dir type-bug An unexpected behavior, bug, or error

Comments

@AlexVndnblcke
Copy link
Mannequin

AlexVndnblcke mannequin commented Jan 18, 2021

BPO 42958
Nosy @ambv, @MojoVampire, @miss-islington, @chanke-mpcdf, @akulakov, @AlexVndnblcke
PRs
  • bpo-42958: Align docstring and filecmp.cmp() for shallow compare #24246
  • bpo-42958: Update filecmp.cmp docs #27166
  • [3.10] bpo-42958: Improve description of shallow= in filecmp.cmp docs (GH-27166) #27607
  • [3.9] bpo-42958: Improve description of shallow= in filecmp.cmp docs (GH-27166) #27608
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-08-04.20:11:39.492>
    created_at = <Date 2021-01-18.20:24:48.561>
    labels = ['3.11', 'type-bug', '3.9', '3.10', 'docs']
    title = "filecmp.cmp(shallow=True) isn't actually shallow when only mtime differs"
    updated_at = <Date 2021-08-04.20:11:39.491>
    user = 'https://github.com/AlexVndnblcke'

    bugs.python.org fields:

    activity = <Date 2021-08-04.20:11:39.491>
    actor = 'lukasz.langa'
    assignee = 'docs@python'
    closed = True
    closed_date = <Date 2021-08-04.20:11:39.492>
    closer = 'lukasz.langa'
    components = ['Documentation']
    creation = <Date 2021-01-18.20:24:48.561>
    creator = 'AlexVndnblcke'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 42958
    keywords = ['patch']
    message_count = 14.0
    messages = ['385225', '385304', '397320', '397340', '397356', '397363', '397364', '397381', '397560', '397561', '398944', '398949', '398950', '398951']
    nosy_count = 7.0
    nosy_names = ['docs@python', 'lukasz.langa', 'josh.r', 'miss-islington', 'chanke', 'andrei.avk', 'AlexVndnblcke']
    pr_nums = ['24246', '27166', '27607', '27608']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue42958'
    versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']

    @AlexVndnblcke
    Copy link
    Mannequin Author

    AlexVndnblcke mannequin commented Jan 18, 2021

    Consider the case where 2 files are shallow compared where only the mtime differs (i.e. the mode and size is identical).

    With filecmp.cmp(f1, f2, shallow=True) a deep compare would be performed behind the scenes since the guard clauses fell through. This discrepancy is either a problem in the docstring or a problem in the comparison itself.

    Two options remain:

    • the documentation is altered and describes that, in case only the mtime differs (i.e. mode and size are equal) a deep compare is performed
    • the behaviour is restricted in that effectively only a shallow compare is performed
    • a shallow compare becomes more restrictive (do not consider the mtime during the compare)

    My preference is to adjust the function to safeguard the meaning of 'shallow' in this context.

    @AlexVndnblcke AlexVndnblcke mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jan 18, 2021
    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented Jan 19, 2021

    This is a problem with the docstring. The actual docs for it are a bit more clear, https://docs.python.org/3/library/filecmp.html#filecmp.cmp :

    "If shallow is true, files with identical os.stat() signatures are taken to be equal. Otherwise, the contents of the files are compared."

    Your patch can't be used because it changes longstanding documented behavior. If you'd like to submit a patch to fix the docstring, that's fine, but we're not going to break existing code to make the function less accurate.

    The patch should probably make the documentation more clear while it's at it.

    1. The original wording could be misinterpreted as having the "Otherwise" apply to shallow=False only, not to the case where shallow=T rue but os.stat doesn't match.
    2. The existing wording isn't clear on what an os.stat() "signature" is, which can leave the impression that the entirety of os.stat is compared (which would only match for hardlinks of the same file), when in fact it's just the file type (stat.S_IFMT(st.st_mode), file vs. directory vs. symlink, etc.), size and mtime.

    Proposed rewording of main docs would be:

    "If shallow is true, files with identical os.stat() signatures (file type, size, and modification time) are taken to be equal. When shallow is false, or the file signatures are identical, the contents of the files are compared."

    A similar wording (or at least, a shorter version of the above, rather than a strictly wrong description of the shallow parameter) could be applied to the docstring.

    @chanke-mpcdf
    Copy link
    Mannequin

    chanke-mpcdf mannequin commented Jul 12, 2021

    Hi,

    this is also discussed in https://bugs.python.org/issue41354.

    Ciao,
    Christof

    @akulakov
    Copy link
    Contributor

    Christof and Alexander, as you both said your preference is to have some way of enforcing "only shallow" compare, I want to clarify -- it seems if you do that, you will get the answer of True if the files are most likely the same (sig is equal), but if the sig is not equal, I'm not sure what answer you expect or why. The sig may be different because of mtime, and then the files may be different or the same, it's anyone's guess.

    I wonder if both of you expect the same behavior, and if so, for the same use case or not?

    @akulakov
    Copy link
    Contributor

    Also see this 16 years old issue: https://bugs.python.org/issue1234674

    @chanke-mpcdf
    Copy link
    Mannequin

    chanke-mpcdf mannequin commented Jul 12, 2021

    Hi Andrei,

    I would follow rsync.
    From the man page:
    """
    [...]
    -c, --checksum
    This changes the way rsync checks if the files have been changed and are in need of a transfer. Without this option, rsync
    uses a "quick check" that (by default) checks if each file’s size and time of last modification match between the sender and
    receiver.
    [...]
    """

    so, yes you can have false positives with a shallow comparison of size + mtime only. But that's usually ok for e.g. incremental backups.

    Wow, the bug is that old...

    @akulakov
    Copy link
    Contributor

    But rsync is a quite more specific tool. For example, unix cmp command does not guess based on any part of stat sig. That's a much closer command in scope to 'filecmp'.

    @chanke-mpcdf
    Copy link
    Mannequin

    chanke-mpcdf mannequin commented Jul 13, 2021

    Andrei,
    cmp is the deep-compare part of filecmp. I thought we were taking about the shallow one.

    Thus,

    • shallow like rsync "quick": size + mtime.
    • deep like cmp

    @akulakov
    Copy link
    Contributor

    I've put up the doc update PR here:
    #27166

    I've also reviewed a few dozen SO questions about filecmp.cmp and shallow arg, many of them find the docs confusing or lacking, so I thought updating docs is definitely very useful.

    I haven't seen any of them asking for "force shallow" behavior, but there's many of them and I haven't read all; searching for "force shallow" doesn't find any.

    I think it might be useful to also add a new 'force shallow' arg, but I hope more people ask for it or compile a list of SO questions asking for it, otherwise it might not be worth additional complexity.

    @akulakov
    Copy link
    Contributor

    Alexander: sorry, I didn't notice your PR because I was first working on a different issue, and then found this as a duplicate, and only noticed there's already an open PR here. If you prefer, we can close my PR and work on updating yours.

    @ambv
    Copy link
    Contributor

    ambv commented Aug 4, 2021

    New changeset a8dc489 by andrei kulakov in branch 'main':
    bpo-42958: Improve description of shallow= in filecmp.cmp docs (GH-27166)
    a8dc489

    @miss-islington
    Copy link
    Contributor

    New changeset c2593b4 by Miss Islington (bot) in branch '3.10':
    bpo-42958: Improve description of shallow= in filecmp.cmp docs (GH-27166)
    c2593b4

    @ambv
    Copy link
    Contributor

    ambv commented Aug 4, 2021

    New changeset 7dad033 by Miss Islington (bot) in branch '3.9':
    bpo-42958: Improve description of shallow= in filecmp.cmp docs (GH-27166) (GH-27608)
    7dad033

    @ambv
    Copy link
    Contributor

    ambv commented Aug 4, 2021

    Thanks all for your effort on improving this! ✨ 🍰 ✨

    @ambv ambv added 3.9 only security fixes 3.10 only security fixes 3.11 only security fixes docs Documentation in the Doc dir and removed stdlib Python modules in the Lib dir labels Aug 4, 2021
    @ambv ambv closed this as completed Aug 4, 2021
    @ambv ambv added 3.9 only security fixes 3.10 only security fixes 3.11 only security fixes docs Documentation in the Doc dir and removed stdlib Python modules in the Lib dir labels Aug 4, 2021
    @ambv ambv closed this as completed Aug 4, 2021
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes 3.10 only security fixes 3.11 only security fixes docs Documentation in the Doc dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants