This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Title: Work with an extra field of gzip and zip files
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.8
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Benjamin.Sergeant, Jason Williams, amijalis, dmi.baranov, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2013-04-09 15:03 by serhiy.storchaka, last changed 2022-04-11 14:57 by admin.

File name Uploaded Description Edit
gzip_extra.diff serhiy.storchaka, 2013-11-16 19:18 review
zipfile_extra.diff serhiy.storchaka, 2013-11-16 19:19 review serhiy.storchaka, 2013-11-16 19:19 serhiy.storchaka, 2013-11-16 19:20
Messages (8)
msg186423 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-04-09 15:03
Gzip files can contains an extra field and some applications use this for extending gzip format. The current GzipFile implementation ignores this field on input and doesn't allow to create a new file with an extra field.

I propose to save an extra field data on reading as a GzipFile attribute and add new parameter for GzipFile constructor for creating new file with an extra field.
msg190295 - (view) Author: Dmi Baranov (dmi.baranov) * Date: 2013-05-29 12:07
I'll be glad to do it, but having some questions for discussing.

First about FEXTRA format - it consists of a series of subfields [1] and current Lib/test/ :: test_read_with_extra having a bit incorrect extra field - sure, if somebody using format from RFC1952. You having a real samples with extra field?.
Should we parse subfields here (I have already asked Jean-Loup Gailly, maintainer of registry of subfield IDs, for current registry values and waiting reply) or will just provide extra header as byte string?

Next about GzipFile's public interface - GzipFile(...).extra look ugly. Should I extend this ticket to support all metadata headers? FNAME, FCOMMENT, FHCRC, etc - correctly reading now, but no ways to get it outside (and no ways to create a file with FCOMMENT and FHCRC now).

Eg, something to like this:
GzipFile(...).metadata.FNAME == 'sample.gz'
GzipFile(..., extra=b'AP6Test', comment='comment')

msg190301 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-29 12:44
I have an almost ready patch but I doubt about interface. It can be discussed. ZIP file entries have similar extra field and I'm planning to add similar feature to the zipfile module too.

Here are preliminary patches.
msg203077 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-11-16 19:24
Some examples:

>>> import zipfile
>>> z = zipfile.ZipFile('')
>>> z.filelist[0].extra
>>> z.filelist[0].extra_map
<zipfile.ExtraMap object at 0xb6fe8bec>
>>> list(z.filelist[0].extra_map.items())
[(21589, b'\x03\xe0\xc3\x87R'), (30837, b'\x01\x04\xe8\x03\x00\x00\x04\xe8\x03\x00\x00')]
>>> import gzip
>>> gz ='')
>>> gz.extra_bytes
>>> gz.extra_map
<gzip.ExtraMap object at 0xb6fd04ac>
>>> list(gz.extra_map.items())
>>> gz.extra_bytes
>>> list(gz.extra_map.items())
[(b'RA', b'\x01\x00\xcb\xe3\x01\x00T\x0b')]
msg365626 - (view) Author: Jason Williams (Jason Williams) Date: 2020-04-02 20:51
What's needed to get this integrated?  It will be great to not have to fork the GZIP.
msg391612 - (view) Author: Alex Mijalis (amijalis) Date: 2021-04-22 16:45
Agreed, it would be really nice to integrate these changes. These special fields are found in gzipped .bam files, a common DNA sequence alignment format used in the bioinformatics community. It would be nice to be able to read and write them with the standard library.
msg393052 - (view) Author: Benjamin Sergeant (Benjamin.Sergeant) Date: 2021-05-05 23:23
There is a comment field too which would be nice to support.

The Go gzip module has a Header class that describe all the metadata. I see in 3.8 mtime was made configurable, so hopefully we can add comment and extra.

For our purpose we'd like to put arbitrary stuff in a gzip file but it is complicated to do so, I might use the patch here and apply to the python gzip module, but that feels a bit  hackish.
msg393053 - (view) Author: Benjamin Sergeant (Benjamin.Sergeant) Date: 2021-05-05 23:33
type Header struct {
    Comment string    // comment
    Extra   []byte    // "extra data"
    ModTime time.Time // modification time
    Name    string    // file name
    OS      byte      // operating system type

This is what the header/extra things look like for reference.
Date User Action Args
2022-04-11 14:57:44adminsetgithub: 61881
2021-05-06 07:45:21nikratiosetnosy: - nikratio
2021-05-05 23:33:10Benjamin.Sergeantsetmessages: + msg393053
2021-05-05 23:23:53Benjamin.Sergeantsetnosy: + Benjamin.Sergeant
messages: + msg393052
2021-04-22 16:45:39amijalissetnosy: + amijalis
messages: + msg391612
2020-04-02 20:51:48Jason Williamssetnosy: + Jason Williams
messages: + msg365626
2018-07-13 12:03:05serhiy.storchakasetversions: + Python 3.8, - Python 3.4
2014-01-24 05:23:00nikratiosetnosy: + nikratio
2013-11-16 19:24:33serhiy.storchakasetmessages: + msg203077
stage: needs patch -> patch review
2013-11-16 19:20:24serhiy.storchakasetfiles: +
2013-11-16 19:19:58serhiy.storchakasetfiles: +
2013-11-16 19:19:13serhiy.storchakasetfiles: + zipfile_extra.diff
2013-11-16 19:18:40serhiy.storchakasetfiles: + gzip_extra.diff
2013-11-16 19:17:55serhiy.storchakasetfiles: - zip_extra.diff
2013-11-16 19:17:43serhiy.storchakasetfiles: - gzip_extra.diff
2013-05-29 12:45:13serhiy.storchakasetfiles: + zip_extra.diff
2013-05-29 12:44:36serhiy.storchakasetfiles: + gzip_extra.diff
keywords: + patch
messages: + msg190301

title: Work with an extra field of gzip files -> Work with an extra field of gzip and zip files
2013-05-29 12:07:24dmi.baranovsetnosy: + dmi.baranov
messages: + msg190295
2013-04-09 15:03:01serhiy.storchakacreate