Issue4272
Created on 2008-11-06 20:46 by jfrechet, last changed 2009-01-04 21:39 by pitrou.
|
msg75580 - (view) |
Author: Jacques Frechet (jfrechet) |
Date: 2008-11-06 20:46 |
|
The gzip header defined in RFC 1952 includes a mandatory "MTIME" field,
originally intended to contain the modification time of the original
uncompressed file. It is often ignored when decompressing, though
gunzip (for example) uses it to set the modification time of the output
file if applicable.
The Python gzip module always sets the MTIME field to the current time,
and always discards MTIME when decompressing. As a result, compressing
the same string using gzip produces different output every time. For
certain applications, especially those involving comparisons or
cryprographic signing of binary files, these spurious changes can be
quite inconvenient. Aside from the MTIME field, the gzip module already
produces entirely deterministic output.
I'm attaching a patch which adds an optional "mtime" argument to the
GzipFile class, giving the caller the option of providing a timestamp
when compressing. Default behavior is unchanged. I've included updated
documentation and three new test cases in the patch.
In order to facilitate testing, the patch also includes code to set the
"mtime" member of the GzipFile instance when decompressing. The first
test case uses the new member to ensure that the timestamp given to the
GzipFile constructor is preserved correctly. The second test checks for
specific values in the entire gzip header (not just the MTIME field) by
reading the compressed file directly, examining individual fields in a
(relatively) flexible way. The third compares the entire compressed
stream against a predetermined sequence of bytes in a relatively
inflexible way. All tests pass on my AMD64 box, and I expect them all
to pass on all supported platforms without any problems. However, If
anybody is concerned that any of the tests sound like they might be too
brittle, I'm certainly not overly attached to them.
If anyone has any further suggestions, I'd be delighted to submit a new
patch.
Thanks!
Jacques
|
|
msg75581 - (view) |
Author: Jacques Frechet (jfrechet) |
Date: 2008-11-06 21:21 |
|
This discussion of the problem and possible workarounds might also be of
interest:
http://stackoverflow.com/questions/264224/setting-the-gzip-timestamp-from-python
|
|
msg75586 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) |
Date: 2008-11-07 00:19 |
|
I considered using a datetime.datetime object instead. But it make more
sense to use a time_t number, like os.stat() and time.time().
About the tests on the gzip format details: I am not an expert of the
gzip format, but are we sure that the compressed data will always be the
same?
Otherwise the patch is fine.
|
|
msg75588 - (view) |
Author: Jacques Frechet (jfrechet) |
Date: 2008-11-07 01:26 |
|
I'm no expert either. The output certainly seems to be deterministic
for a given version of zlib, and I'm not aware of any prior versions of
zlib that produce different compressed output. However, my
understanding is that there is more than one possible compressed
representation of a given uncompressed input, so it's entirely possible
that a past or future version of zlib might produce compressed output
that is different while remaining interoperable. I have no idea whether
the zlib people care specifically about producing identical compressed
output across versions or not. It might be a big deal to them, or they
might have other priorities.
I included the third test because I am guessing that the compressed
output probably won't change very soon, and that if it does, it might be
interesting to know that it changed. If that sounds to you like it
might be more trouble than it's worth, then I think the right thing to
do would be to simply get rid of the third test and keep the first two.
|
|
msg78679 - (view) |
Author: Antoine Pitrou (pitrou) |
Date: 2009-01-01 02:12 |
|
test_literal_output looks really too strict to me. At most, you could
check that the header and trailer are unchanged, but it would probably
make it equivalent to test_metadata.
Other than that, I think it's an useful addition.
|
|
msg78758 - (view) |
Author: Jacques Frechet (jfrechet) |
Date: 2009-01-02 05:12 |
|
I am uploading a new patch, identical to the previous patch except that
it does not contain the ill-advised third test case
(test_literal_output). The patch still applies cleanly and the tests
still pass.
|
|
msg79086 - (view) |
Author: Antoine Pitrou (pitrou) |
Date: 2009-01-04 21:39 |
|
The patches have been committed, thanks!
|
|
| Date |
User |
Action |
Args |
| 2009-01-04 21:39:42 | pitrou | set | status: open -> closed resolution: fixed messages:
+ msg79086 |
| 2009-01-02 05:13:52 | jfrechet | set | files:
+ gzip-mtime-revised-2.x.patch |
| 2009-01-02 05:12:50 | jfrechet | set | files:
+ gzip-mtime-revised-py3k.patch messages:
+ msg78758 |
| 2009-01-01 02:12:56 | pitrou | set | priority: normal nosy:
+ pitrou stage: patch review messages:
+ msg78679 versions:
+ Python 3.1, Python 2.7 |
| 2008-11-07 01:26:34 | jfrechet | set | messages:
+ msg75588 |
| 2008-11-07 00:19:17 | amaury.forgeotdarc | set | nosy:
+ amaury.forgeotdarc messages:
+ msg75586 |
| 2008-11-06 21:21:44 | jfrechet | set | messages:
+ msg75581 |
| 2008-11-06 20:49:09 | jfrechet | set | files:
+ gzip-mtime-2.x.patch |
| 2008-11-06 20:46:09 | jfrechet | create | |
|