classification
Title: GzipFile's .seekable() returns True even if underlying buffer is not seekable
Type: Stage: patch review
Components: Library (Lib) Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Walt Askew, cheryl.sabella, martin.panter, pitrou, serhiy.storchaka, xtreak
Priority: normal Keywords: patch

Created on 2018-03-28 16:28 by Walt Askew, last changed 2018-11-09 08:55 by serhiy.storchaka.

Pull Requests
URL Status Linked Edit
PR 6303 open python-dev, 2018-03-28 23:46
Messages (5)
msg314613 - (view) Author: Walt Askew (Walt Askew) Date: 2018-03-28 16:28
The seekable method on gzip.GzipFile always returns True, even if the underlying buffer is not seekable. However, if seek is called on the GzipFile, the seek will fail unless the underlying buffer is seekable. This can cause consumers of the GzipFile object to mistakenly believe calling seek on the object is safe, when in fact it will lead to an exception.

For example, this led to a bug when I was trying to use requests & boto3 to stream & decompress an S3 upload like so:

resp = requests.get(uri, stream=True)
decompressed = gzip.GzipFile(fileobj=resp.raw)
boto3.client('s3').upload_fileobj(decompressed, Bucket=bucket, Key=key)

boto3 checks the seekable method on the the GzipFile, chooses a code path based on the file being seekable but later raises an exception when the seek call fails because the underlying HTTP stream is not seekable.
msg327206 - (view) Author: Cheryl Sabella (cheryl.sabella) * (Python triager) Date: 2018-10-06 01:04
Allowing for non seekable files was added in issue1675951.  And under that issue in msg117131, the author of the change wrote:
"The patch creates another problem with is not yet fixed: The implementation of .seekable() is becoming wrong. As one can now use non seekable files the implementation should check if the file object used for reading is really seekable."

issue23529 made significant changes to the code and seekable() is again mentioned in msg239245 and subsequent comments.

Nosying the devs who worked on those issues.
msg327226 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2018-10-06 06:03
If a change is made, it would be nice to bring the “gzip”, “bzip” and LZMA modules closer together. The current “bzip” and LZMA modules rely on the underlying “seekable” method without a fallback implementation, but also have a check for read mode.

I think the seeking functionality in these modules is a misfeature. But since it is already here, it is probably best to leave it alone, and just document it.

My comment about making “seekable” stricter is at <https://bugs.python.org/review/23529/diff/14296/Lib/gzip.py#oldcode550>. Even if the underlying stream is not seekable, GzipFile can still fast-forward. Here is a demonstration:

>>> z = BytesIO(bytes.fromhex(
...     "1F8B08000000000002FFF348CD29D051F05448CC55282E294DCE56C8CC53485448AFCA"
...     "2C5048CBCC490500F44BF0A01F000000"
... ))
>>> def seek(*args): raise UnsupportedOperation()
... 
>>> z.seek = seek  # Make the underlying stream not seekable
>>> f = GzipFile(fileobj=z)
>>> f.read(10)
b'Help, I am'
>>> f.seek(20)  # Fast forward
20
>>> f.read()
b'a gzip file'
>>> f.seek(0)  # Rewind
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/proj/python/cpython/Lib/gzip.py", line 368, in seek
    return self._buffer.seek(offset, whence)
  File "/home/proj/python/cpython/Lib/_compression.py", line 137, in seek
    self._rewind()
  File "/home/proj/python/cpython/Lib/gzip.py", line 515, in _rewind
    super()._rewind()
  File "/home/proj/python/cpython/Lib/_compression.py", line 115, in _rewind
    self._fp.seek(0)
  File "/home/proj/python/cpython/Lib/gzip.py", line 105, in seek
    return self.file.seek(off)
  File "<stdin>", line 1, in seek
io.UnsupportedOperation
msg329505 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-11-09 08:45
I share Martin's opinion that this is a misfeature. User code can check seekable() and use seek() if it returns True or cache necessary data in memory if it returns False, because it is expected that seek() is more efficient. But in case of GzipFile it is not efficient, and can lead to decompression the whole content of the file and to much worse performance.
msg329506 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-11-09 08:55
And I share Martin's concern about fast-forward with an unseekable underlying file. If this works in current code, we can't simply return break it. This may mean that we can't change the implementation of GzipFile.seekable() at all, even if it lies in some cases.
History
Date User Action Args
2018-11-09 08:55:32serhiy.storchakasetmessages: + msg329506
2018-11-09 08:45:28serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg329505
2018-10-06 06:03:22martin.pantersetmessages: + msg327226
2018-10-06 01:04:28cheryl.sabellasetnosy: + cheryl.sabella, pitrou, martin.panter

messages: + msg327206
versions: + Python 3.8, - Python 3.6
2018-09-14 08:39:14xtreaksetnosy: + xtreak
2018-03-28 23:46:24python-devsetkeywords: + patch
stage: patch review
pull_requests: + pull_request6021
2018-03-28 16:28:47Walt Askewcreate