classification
Title: Extra gzip headers breaks _read_gzip_header
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.4, Python 3.3, Python 3.2, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: Arfrever, benjamin.peterson, christian.heimes, georg.brandl, larry, maubp, pitrou, python-dev, serhiy.storchaka
Priority: release blocker Keywords:

Created on 2013-04-08 17:55 by maubp, last changed 2013-05-12 10:32 by python-dev. This issue is now closed.

Messages (9)
msg186320 - (view) Author: Peter (maubp) Date: 2013-04-08 17:55
Regression in Python 3.3.0 to 3.3.1, tested under Mac OS X 10.8 and CentOS Linux 64bit.

The same regression also present in going from Python 2.7.3 from 2.7.4, does that need a separate issue filed?

Consider this VALID GZIP file, human link:
https://github.com/biopython/biopython/blob/master/Tests/GenBank/cor6_6.gb.bgz

Binary link, only a small file:
https://raw.github.com/biopython/biopython/master/Tests/GenBank/cor6_6.gb.bgz

This is compressed using a GZIP variant called BGZF which uses multiple blocks and records additional tags in the header, for background see:
http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html

$ curl -O https://raw.github.com/biopython/biopython/master/Tests/GenBank/cor6_6.gb.bgz
$ cat cor6_6.gb.bgz | gunzip | wc
     320    1183   14967

Now for the bug, expected behaviour:

$ python3.2
Python 3.2 (r32:88445, Feb 28 2011, 17:04:33) 
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import gzip
>>> handle = gzip.open("cor6_6.gb.bgz", "rb")
>>> data = handle.read()
>>> handle.close()
>>> len(data)
14967
>>> quit()

Broken behaviour:

$ python3.3
Python 3.3.1 (default, Apr  8 2013, 17:54:08) 
[GCC 4.2.1 Compatible Apple Clang 4.0 ((tags/Apple/clang-421.0.57))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import gzip
>>> handle = gzip.open("cor6_6.gb.bgz", "rb")
>>> data = handle.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/pjcock/lib/python3.3/gzip.py", line 359, in read
    while self._read(readsize):
  File "/Users/pjcock/lib/python3.3/gzip.py", line 432, in _read
    if not self._read_gzip_header():
  File "/Users/pjcock/lib/python3.3/gzip.py", line 305, in _read_gzip_header
    self._read_exact(struct.unpack("<H", self._read_exact(2)))
  File "/Users/pjcock/lib/python3.3/gzip.py", line 282, in _read_exact
    data = self.fileobj.read(n)
  File "/Users/pjcock/lib/python3.3/gzip.py", line 81, in read
    return self.file.read(size)
TypeError: integer argument expected, got 'tuple'


The bug is very simple, an error in line 205 of gzip.py:

203     if flag & FEXTRA:
204         # Read & discard the extra field, if present
205         self._read_exact(struct.unpack("<H", self._read_exact(2)))

The struct.unpack method returns a single element tuple, thus a fix is:

203     if flag & FEXTRA:
204         # Read & discard the extra field, if present
205         extra_len, = struct.unpack("<H", self._read_exact(2))
206         self._read_exact(extra_len)

This bug was identified via failing Biopython unit tests under Python 2.7.4 and 3.3.1, which all pass with this minor fix applied.
msg186336 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-04-08 19:27
Oh-oh-h, it's my fault. Thank you, Peter, for your report and the proposed fix.
msg186339 - (view) Author: Roundup Robot (python-dev) Date: 2013-04-08 19:39
New changeset 29f0836c0456 by Serhiy Storchaka in branch '2.7':
Close #17666: Fix reading gzip files with an extra field.
http://hg.python.org/cpython/rev/29f0836c0456

New changeset f78d2605f452 by Serhiy Storchaka in branch '3.3':
Close #17666: Fix reading gzip files with an extra field.
http://hg.python.org/cpython/rev/f78d2605f452

New changeset cd5e3adc6fc1 by Serhiy Storchaka in branch 'default':
Close #17666: Fix reading gzip files with an extra field.
http://hg.python.org/cpython/rev/cd5e3adc6fc1
msg186390 - (view) Author: Peter (maubp) Date: 2013-04-09 09:38
Reopening: The same regression issue affects Python 3.2.4 as well, so if the fix could be committed to that branch as well that would be great.

Long term, I infer that there are no GZIP files in the test suite which use the GZIP header to store metadata (otherwise this bug would have been caught earlier). Should that be raised as a separate issue? I can prepare a very small test file if you like and attach it here.
msg186394 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-04-09 11:41
Adding Georg for 3.2.
msg186416 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-04-09 14:44
The committed patch contains a simple test for gzip file with an extra field. When we add the feature to create gzip files with an extra field, this test can be extended.
msg186492 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2013-04-10 13:16
Serhiy, do you want to fix this also in 3.2? I'll roll a brown-paper-bag release of 3.2 and 3.3 together with 2.7.5.
msg188990 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2013-05-12 09:00
Cherry-picked to 3.2 branch as c31ff361cde3.
msg189004 - (view) Author: Roundup Robot (python-dev) Date: 2013-05-12 10:32
New changeset c31ff361cde3 by Serhiy Storchaka in branch '3.2':
Close #17666: Fix reading gzip files with an extra field.
http://hg.python.org/cpython/rev/c31ff361cde3
History
Date User Action Args
2013-05-12 10:32:43python-devsetmessages: + msg189004
stage: commit review -> resolved
2013-05-12 09:00:19georg.brandlsetstatus: open -> closed
resolution: fixed
messages: + msg188990
2013-04-10 18:04:53Arfreversetnosy: + Arfrever
2013-04-10 16:44:44gregory.p.smithsetpriority: normal -> release blocker
nosy: + benjamin.peterson, larry
2013-04-10 13:16:53georg.brandlsetmessages: + msg186492
2013-04-09 14:44:47serhiy.storchakasetmessages: + msg186416
2013-04-09 11:52:39christian.heimessetnosy: + christian.heimes

resolution: fixed -> (no value)
stage: resolved -> commit review
2013-04-09 11:41:40pitrousetnosy: + georg.brandl, pitrou
messages: + msg186394
2013-04-09 09:38:15maubpsetstatus: closed -> open

messages: + msg186390
versions: + Python 3.2
2013-04-08 19:39:38python-devsetstatus: open -> closed

nosy: + python-dev
messages: + msg186339

resolution: fixed
stage: needs patch -> resolved
2013-04-08 19:28:00serhiy.storchakasetmessages: + msg186336
2013-04-08 18:12:37pitrousetversions: + Python 2.7, Python 3.4
nosy: + serhiy.storchaka

assignee: serhiy.storchaka
type: behavior
stage: needs patch
2013-04-08 17:55:12maubpcreate