This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author vstinner
Recipients SilentGhost, christian.heimes, gregory.p.smith, martin.panter, palaviv, rhettinger, terry.reedy, vstinner
Date 2016-03-29.21:30:46
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1459287047.03.0.152024322166.issue26488@psf.upfronthosting.co.za>
In-reply-to
Content
About the compatibility with existing tools, I recall a discussion when the tarfile module got a CLI. First I expected a clone of the UNIX tar command, but it was decided to design a new *simpler* CLI.

---------------------------------------------------
$ python3 -m tarfile
usage: tarfile.py [-h] [-v] [-l <tarfile> | -e <tarfile> [<output_dir> ...] |
                  -c <name> [<file> ...] | -t <tarfile>]

A simple command line interface for tarfile module.

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         Verbose output
  -l <tarfile>, --list <tarfile>
                        Show listing of a tarfile
  -e <tarfile> [<output_dir> ...], --extract <tarfile> [<output_dir> ...]
                        Extract tarfile into target dir
  -c <name> [<file> ...], --create <name> [<file> ...]
                        Create tarfile from sources
  -t <tarfile>, --test <tarfile>
                        Test if a tarfile is valid
---------------------------------------------------


A common trap of the md5sum CLI is that users write "echo string|md5sum" which adds a newline to string. For an unknown reason, my french manual page of the md5sum command has a -s STRING/--string=STRING argument, but not my effective md5sum program. Maybe we should consider adding such option to avoid the trap?


Do you want to implement a function to compare computed hash to a file which contains the expected hash? Check for file integrity, md5sum -c FILE/--check=FILE. Example:
------
$ md5sum test_socket_with.patch > check
$ cat check 
cfc1d69e76c827c32af4f28f50714a5e  test_socket_with.patch

$ md5sum -c check
test_socket_with.patch: OK

$ vim test_socket_with.patch 
<modify something in the file>

$ md5sum -c check
test_socket_with.patch: FAILED
md5sum: WARNING: 1 computed checksum did NOT match
------


I worked hard to release the GIL when a hash is released. It would be super cool (a killer feature?) to automatically spawn threads to compute the hash. For example, use N threads where N is the number of CPU (os.cpu_count() or 1). Last time I wrote my md5sum.py, it was much faster than the UNIX md5sum tool since it uses all my CPU cores. You should just ensure that output is written in the correct order.


Raymond wrote:
> 1) Neither the md5 or shasum command-line tools offer control over the blocksize.  I suggest that option be dropped from the command-line API giving a nice simplification and usability improvement.

I agree. You should compute it per file using os.stat().st_blksize:

   https://docs.python.org/dev/library/os.html#os.stat_result.st_blksize

The io module uses st_blksize if it is greater than 1, or 8 * 1024 bytes.

(By the way, it looks like shutil.copyfile() doesn't use st_blksize.)
History
Date User Action Args
2016-03-29 21:30:47vstinnersetrecipients: + vstinner, rhettinger, terry.reedy, gregory.p.smith, christian.heimes, SilentGhost, martin.panter, palaviv
2016-03-29 21:30:47vstinnersetmessageid: <1459287047.03.0.152024322166.issue26488@psf.upfronthosting.co.za>
2016-03-29 21:30:47vstinnerlinkissue26488 messages
2016-03-29 21:30:46vstinnercreate