This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: gzip, bz2, lzma: peek advances file position of existing file object
Type: enhancement Stage: resolved
Components: Documentation Versions: Python 3.3, Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: docs@python Nosy List: docs@python, madison.may, nadeem.vawda, nbargnesi, python-dev, r.david.murray
Priority: normal Keywords: patch

Created on 2013-07-11 19:34 by nbargnesi, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
issue18430.patch nbargnesi, 2013-07-11 21:59 review
Messages (15)
msg192890 - (view) Author: Nick Bargnesi (nbargnesi) Date: 2013-07-11 19:34
Using existing file objects as arguments to the open functions in the gzip, bz2, and lzma libraries can cause the underlying fileobj position to get changed - and not quite in ways one would expect.

Calling peek against the returned file objects -- gzip.GzipFile, bz2.BZ2File, and lzma.LZMAFile will in one scenario advance the position of the supplied file object:

    >>> import bz2
    >>> fileobj = open('test.bz2', mode='rb')
    >>> bzfile = bz2.open(fileobj)
    >>>
    >>> # file positions at 0
    >>> assert fileobj.tell() == 0, bzfile.tell() == 0
    >>>
    >>> bzfile.peek()
    b'Test file.\n'
    >>> fileobj.tell()
    52

If after the initial peek, we rewind the underlying fileobj and peek again, the behavior changes:

    >>> fileobj.seek(0)
    0
    >>> bzfile.peek()
    b'Test file.\n'
    >>> fileobj.tell()
    0

The second scenario serves to complicate things a bit with the change in behavior.

I would be less surprised if the module documentation simply stated the affect on file position of the file object being used. Though it would be beautiful if the underlying file object didn't change at all. The latter seems possible since the three modules know whether the file object is seekable.

The gzip and lzma modules exhibit similar behavior - gzip for example:

    >>> import gzip
    >>> fileobj = open('test.gz', mode='rb')
    >>> gzfile = gzip.open(fileobj)
    >>>
    >>> # file positions at 0
    >>> assert fileobj.tell() == 0 and gzfile.tell() == 0
    >>>
    >>> gzfile.peek(1)
    b'Test file.\n'
    >>> fileobj.tell()
    36
    >>>
    >>> # rewind, and do it again
    >>> fileobj.seek(0)
    >>>
    >>> gzfile.peek(1)
    b'Test file.\n'
    >>> fileobj.tell()
    0
msg192893 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-07-11 19:54
I guess this issue can't be fixed.
msg192896 - (view) Author: Madison May (madison.may) * Date: 2013-07-11 20:09
Why would something like the following work?

#At the beginning of peek()
#keep track of prior offset
position = self.fileobj.tell()

#immediately before return statement
#restore previous fileobj offset
self.fileobj.seek(position)
msg192898 - (view) Author: Madison May (madison.may) * Date: 2013-07-11 20:10
*wouldn't
msg192900 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-07-11 20:26
Regardless of whether or not it *can* be fixed, I personally would not expect the file position to be either unchanged or predictable.  The file object is being passed to something that is going to read and/or write from it, after all.  If the calling application needs the file to be in a particular seek position, it should save it and reset it, I think.
msg192904 - (view) Author: Nick Bargnesi (nbargnesi) Date: 2013-07-11 20:57
In that the underlying file position is not deterministic when used like this, I'm inclined to agree.

Though faced with documentation stating it does not "advance the file position", it certainly is less than explicit about it.
msg192906 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-07-11 21:03
Hmm, yes.  Perhaps it should say "does not advance the <gzip> read position"?
msg192910 - (view) Author: Nick Bargnesi (nbargnesi) Date: 2013-07-11 21:20
That's an improvement.

Using wording similar to the constructor's:

Calling a GzipFile object’s peek() method does not advance its position, but the fileobj may be affected. The caller may wish to save the fileobj position prior to calling peek() and resetting it upon return.
msg192911 - (view) Author: Madison May (madison.may) * Date: 2013-07-11 21:31
Sounds good to me.
msg192914 - (view) Author: Nick Bargnesi (nbargnesi) Date: 2013-07-11 21:59
Proposed documentation change to gzip, bz2, and lzma modules, in patch form.
msg192972 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-07-12 21:36
I don't think the comment about save and restore of the file position adds enough value to be worth the possible confusion ("do I only need to save and restore around peek?  Why?")  I think the sentence should read, ex, "Calling a :class:`LZMAFile` object's :meth:`peek` method does not advance its access position, but the access position of the underlying :term:`file object` may be affected."

Note that I have no objection to making the two sync up, if it is in fact possible.  But if Serhiy says it can't be, I expect he's right.
msg192979 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-07-12 22:25
A file can be non-seekable (pipe, socket) and we can't "unread" unused data back.

We can't implement GzipFile.peek() in terms of underlied file's peek() because peek() doesn't guaranteed return more than one byte, but it should return at least one byte if the file is not ended. We can't detect the end of compressed data without reading some data past the end of compressed data.
msg192988 - (view) Author: Nick Bargnesi (nbargnesi) Date: 2013-07-13 02:15
I'd suggest sticking with "file position" instead of switching to "access position". E.g., the complete lzma wording would be:

    Return buffered data without advancing the file position. At least
    one byte of data will be returned, unless EOF has been reached. The exact
    number of bytes returned is unspecified (the *size* argument is ignored).

    Although calling a :class:`LZMAFile` object's :meth:`peek` method does not
    advance its file position, the file position of the underlying
    :term:`file object` may be affected.

The point about mentioning "save and restore" notwithstanding, *any* documentation about the effect on position change is a step in the right direction. If the file position not changing is the better scenario, having the side effect documented is at least good. We can save people's time by being explicit about the side effect _and_ maintain code readability.

Imagine hunting for a bug manifesting itself as a change in file position only to find out a peek() call was the cause.
msg192989 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-07-13 02:42
I chose 'access position' because that's the terminology used in the 'seek' man page.  'file position' sounds odd to me, but since that is the term already used it does make sense to be consistent.
msg205591 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-12-08 18:50
New changeset 5ca6e8af0aab by Nadeem Vawda in branch '3.3':
#18430: Document that peek() may change the position of the underlying file for
http://hg.python.org/cpython/rev/5ca6e8af0aab

New changeset 0f587fe304be by Nadeem Vawda in branch 'default':
Closes #18430: Document that peek() may change the position of the underlying
http://hg.python.org/cpython/rev/0f587fe304be
History
Date User Action Args
2022-04-11 14:57:47adminsetgithub: 62630
2013-12-08 18:50:27python-devsetstatus: open -> closed

nosy: + python-dev
messages: + msg205591

resolution: fixed
stage: needs patch -> resolved
2013-08-06 11:51:34serhiy.storchakasetnosy: - serhiy.storchaka
2013-07-13 02:42:24r.david.murraysetassignee: docs@python
components: + Documentation, - Library (Lib)
versions: + Python 3.4
nosy: + docs@python

messages: + msg192989
stage: needs patch
2013-07-13 02:15:33nbargnesisetmessages: + msg192988
2013-07-12 22:25:04serhiy.storchakasetmessages: + msg192979
2013-07-12 21:36:15r.david.murraysetmessages: + msg192972
2013-07-11 21:59:49nbargnesisetfiles: + issue18430.patch
keywords: + patch
messages: + msg192914
2013-07-11 21:31:43madison.maysetmessages: + msg192911
2013-07-11 21:20:13nbargnesisetmessages: + msg192910
2013-07-11 21:03:55r.david.murraysetmessages: + msg192906
2013-07-11 20:57:43nbargnesisetmessages: + msg192904
2013-07-11 20:26:36r.david.murraysetnosy: + r.david.murray
messages: + msg192900
2013-07-11 20:10:33madison.maysetmessages: + msg192898
2013-07-11 20:09:52madison.maysetnosy: + madison.may
messages: + msg192896
2013-07-11 19:54:11serhiy.storchakasetnosy: + nadeem.vawda, serhiy.storchaka
messages: + msg192893
2013-07-11 19:34:01nbargnesicreate