This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Dean Morin
Recipients Dean Morin
Date 2018-06-18.18:18:51
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1529345931.78.0.56676864532.issue33896@psf.upfronthosting.co.za>
In-reply-to
Content
By default `filecmp.cmp()` has `shallow=True` which can produce some surprising behavior.

In the docs it states:

> If shallow is true, files with identical os.stat() signatures are taken to be equal.

However the "signature" only considers the file mode, file size, and file modification time, which is not sufficient. `cmp()` will return `True` (files are equal) in some circumstances for files that actually differ. Depending on the underlying file system, the same python script will return `True` or `False` when `cmp()` is called on the exact same files. I'll add the long-winded details at the bottom.

To fix, I believe `st.st_ino` should be included in `_sig` (https://github.com/python/cpython/blob/3.7/Lib/filecmp.py#L68).

I'm in the middle of a move, but I can make a PR in the next couple weeks if this seems like a reasonable fix and no one else gets around to it.

The long version is that we're migrating some existing reports to a new data source. The goal is to produce identical csv files from both data sources. I have a python script that pulls down both csv files and uses `cmp()` to compare them. 

On my machine, the script correctly discovers the differences between the two. One of the date columns has incorrect dates in the new version.

However on my colleagues machine, the script fails to discover the differences and shows that the csv files are identical.

The difference is that on my machine, `os.stat(f).st_mtime` is a timestamp which includes fractional seconds (1529108360.1955538), but only includes the seconds (1529108360.0) on my colleagues machine. Since only the dates differed within the csvs, both files had the same file mode, file size, and both were downloaded within the same second.

We got a few more people to see what they got for `st_mtime`. The link could be the file system used. We're all using macs, but for those of us using an APFS Volume disk, `st_mtime` returns a timestamp which includes fractional seconds, and for those of us using a Logical Volume Mac OS Extended disk, it returns a timestamp which only includes the seconds (1529108360.0).

When comparing os.stat() between the two differing csv files, the only difference (other than fractional seconds for various timestamps) was `st_ino` which is why I believe it should be included in `_sig()`.
History
Date User Action Args
2018-06-18 18:18:51Dean Morinsetrecipients: + Dean Morin
2018-06-18 18:18:51Dean Morinsetmessageid: <1529345931.78.0.56676864532.issue33896@psf.upfronthosting.co.za>
2018-06-18 18:18:51Dean Morinlinkissue33896 messages
2018-06-18 18:18:51Dean Morincreate