classification
Title: set timestamp in gzip stream
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.1, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, jfrechet, pitrou
Priority: normal Keywords: patch

Created on 2008-11-06 20:46 by jfrechet, last changed 2009-01-04 21:39 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
gzip-mtime-py3k.patch jfrechet, 2008-11-06 20:46 gzip mtime patch (vs branches/py3k)
gzip-mtime-2.x.patch jfrechet, 2008-11-06 20:49 gzip mtime patch (vs 2.x trunk)
gzip-mtime-revised-py3k.patch jfrechet, 2009-01-02 05:12 same patch without test_literal_output [py3k]
gzip-mtime-revised-2.x.patch jfrechet, 2009-01-02 05:13 same patch without test_literal_output [2.x trunk]
Messages (7)
msg75580 - (view) Author: Jacques Frechet (jfrechet) Date: 2008-11-06 20:46
The gzip header defined in RFC 1952 includes a mandatory "MTIME" field,
originally intended to contain the modification time of the original
uncompressed file.  It is often ignored when decompressing, though
gunzip (for example) uses it to set the modification time of the output
file if applicable.

The Python gzip module always sets the MTIME field to the current time,
and always discards MTIME when decompressing.  As a result, compressing
the same string using gzip produces different output every time.  For
certain applications, especially those involving comparisons or
cryprographic signing of binary files, these spurious changes can be
quite inconvenient.  Aside from the MTIME field, the gzip module already
produces entirely deterministic output.

I'm attaching a patch which adds an optional "mtime" argument to the
GzipFile class, giving the caller the option of providing a timestamp
when compressing.  Default behavior is unchanged.  I've included updated
documentation and three new test cases in the patch.

In order to facilitate testing, the patch also includes code to set the
"mtime" member of the GzipFile instance when decompressing.  The first
test case uses the new member to ensure that the timestamp given to the
GzipFile constructor is preserved correctly.  The second test checks for
specific values in the entire gzip header (not just the MTIME field) by
reading the compressed file directly, examining individual fields in a
(relatively) flexible way.  The third compares the entire compressed
stream against a predetermined sequence of bytes in a relatively
inflexible way.  All tests pass on my AMD64 box, and I expect them all
to pass on all supported platforms without any problems.  However, If
anybody is concerned that any of the tests sound like they might be too
brittle, I'm certainly not overly attached to them.

If anyone has any further suggestions, I'd be delighted to submit a new
patch.

Thanks!

Jacques
msg75581 - (view) Author: Jacques Frechet (jfrechet) Date: 2008-11-06 21:21
This discussion of the problem and possible workarounds might also be of
interest:

 
http://stackoverflow.com/questions/264224/setting-the-gzip-timestamp-from-python
msg75586 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-11-07 00:19
I considered using a datetime.datetime object instead. But it make more 
sense to use a time_t number, like os.stat() and time.time().

About the tests on the gzip format details: I am not an expert of the 
gzip format, but are we sure that the compressed data will always be the 
same?
Otherwise the patch is fine.
msg75588 - (view) Author: Jacques Frechet (jfrechet) Date: 2008-11-07 01:26
I'm no expert either.  The output certainly seems to be deterministic
for a given version of zlib, and I'm not aware of any prior versions of
zlib that produce different compressed output.  However, my
understanding is that there is more than one possible compressed
representation of a given uncompressed input, so it's entirely possible
that a past or future version of zlib might produce compressed output
that is different while remaining interoperable.  I have no idea whether
the zlib people care specifically about producing identical compressed
output across versions or not.  It might be a big deal to them, or they
might have other priorities.

I included the third test because I am guessing that the compressed
output probably won't change very soon, and that if it does, it might be
interesting to know that it changed.  If that sounds to you like it
might be more trouble than it's worth, then I think the right thing to
do would be to simply get rid of the third test and keep the first two.
msg78679 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-01 02:12
test_literal_output looks really too strict to me. At most, you could
check that the header and trailer are unchanged, but it would probably
make it equivalent to test_metadata.
Other than that, I think it's an useful addition.
msg78758 - (view) Author: Jacques Frechet (jfrechet) Date: 2009-01-02 05:12
I am uploading a new patch, identical to the previous patch except that
it does not contain the ill-advised third test case
(test_literal_output).  The patch still applies cleanly and the tests
still pass.
msg79086 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-04 21:39
The patches have been committed, thanks!
History
Date User Action Args
2009-01-04 21:39:42pitrousetstatus: open -> closed
resolution: fixed
messages: + msg79086
2009-01-02 05:13:52jfrechetsetfiles: + gzip-mtime-revised-2.x.patch
2009-01-02 05:12:50jfrechetsetfiles: + gzip-mtime-revised-py3k.patch
messages: + msg78758
2009-01-01 02:12:56pitrousetpriority: normal
nosy: + pitrou
stage: patch review
messages: + msg78679
versions: + Python 3.1, Python 2.7
2008-11-07 01:26:34jfrechetsetmessages: + msg75588
2008-11-07 00:19:17amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg75586
2008-11-06 21:21:44jfrechetsetmessages: + msg75581
2008-11-06 20:49:09jfrechetsetfiles: + gzip-mtime-2.x.patch
2008-11-06 20:46:09jfrechetcreate