classification
Title: Wrapping TextIOWrapper around gzip files
Type: enhancement Stage: resolved
Components: Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: alex, dabeaz, eric.araujo, gruszczy, haypo, jcea, nadeem.vawda, pitrou, python-dev, r.david.murray
Priority: normal Keywords: patch

Created on 2010-12-29 19:53 by dabeaz, last changed 2011-04-07 13:20 by jcea. This issue is now closed.

Files
File name Uploaded Description Edit
gzipfile_read1.diff nadeem.vawda, 2011-04-04 10:56 Add GzipFile.read1() plus tests review
gzipfile_read1.diff nadeem.vawda, 2011-04-04 17:01 Fix bug in previous patch review
Messages (30)
msg124870 - (view) Author: David Beazley (dabeaz) Date: 2010-12-29 19:53
Is something like this supposed to work:

>>> import gzip
>>> import io
>>> f = io.TextIOWrapper(gzip.open("foo.gz"),encoding='ascii'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: readable

In a nutshell--reading a .gz file as text.
msg124875 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-29 21:59
Since GZipFile inherits from BufferedIOBase, and TextIOWrapper is supposed to be designed to wrap a BufferedIOBase object, I would say yes it ought to work.  On the other hand there may also be a doc error there, since it may be that TextIOWrapper actually needs to wrap one of the subclasses of BufferedIOBase.
msg124876 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-29 22:00
Oops.  It only has that inheritance in 3.2.
msg124877 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-29 22:01
Heh, and 2.7.  Fixing versions yet again.
msg124878 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-12-29 22:12
This should be easy to fix, if only the "readable" and "writable" methods are needed. Do you want to try writing a patch?
msg124879 - (view) Author: David Beazley (dabeaz) Date: 2010-12-29 22:49
It goes without saying that this also needs to be checked with the bz2 module. A quick check seems to indicate that it has the same problem.

While you're at it, maybe someone could add an 'open' function to bz2 to make it symmetrical with gzip as well :-).
msg124880 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-29 22:55
bz2 is a pure C module, so that's a very different situation.
msg124881 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-12-29 22:56
> While you're at it, maybe someone could add an 'open' function to bz2
> to make it symmetrical with gzip as well :-).

That's a nice idea, but quite orthogonal to this issue.
msg124882 - (view) Author: David Beazley (dabeaz) Date: 2010-12-29 22:58
C or not, wrapping a BZ2File instance with a TextIOWrapper to get text still seems like something that someone might want to do.  I doubt it would take much modification to give BZ2File instances the required set of methods.
msg124884 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-29 23:05
Right, but in the bz2 case I think it is a feature request rather than a bugfix.  In any case it should be a separate issue.
msg124885 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-12-29 23:10
> C or not, wrapping a BZ2File instance with a TextIOWrapper to get text
> still seems like something that someone might want to do.  I doubt it
> would take much modification to give BZ2File instances the required
> set of methods.

BZ2File uses FILE pointers internally so it may be more complicated than
it looks to be (because the methods may not have the right semantics).
msg124886 - (view) Author: David Beazley (dabeaz) Date: 2010-12-29 23:12
Do Python devs really view gzip and bz2 as two totally completely different animals?  They both have the same functionality and would be used for the same kinds of things.   Maybe I'm missing something.
msg124887 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-12-29 23:14
> Do Python devs really view gzip and bz2 as two totally completely
> different animals?  They both have the same functionality and would be
> used for the same kinds of things.   Maybe I'm missing something.

Well, the reality of divergent implementation strategies trumps the
theory of API compatibility :) The approach taken by bz2 is IMO
regrettable, but it's not a ten minutes job to write it again from
scratch.
msg124888 - (view) Author: David Beazley (dabeaz) Date: 2010-12-29 23:17
Hmmm. Interesting.  In the big picture, it might be an interesting project for someone (not necessarily the core devs) to sit down and refactor both of these modules so that they play nice with Python 3 I/O system.  Obviously that's a project outside the scope of this bug or the 3.2 release for that matter.
msg129187 - (view) Author: David Beazley (dabeaz) Date: 2011-02-23 13:14
Bump.  This is still broken in Python 3.2.
msg129198 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-02-23 14:29
If a patch had been proposed it probably would have gotten in to 3.2.  Maybe someone (perhaps you?) will find the time before 3.2.1.

Someone has decided to work on the bz2 rewrite, by the way (issue 5863).
msg129199 - (view) Author: David Beazley (dabeaz) Date: 2011-02-23 14:37
If I can find some time, I may took a look at this.   I just noticed that similar problems arise trying to wrap TextIOWrapper around the file-like objects returned by urllib.request.urlopen as well.

In the big picture, some discussion of what it means to be "file-like" might be in order.   If something is "file-like" and binary, should that always imply that I be able to wrap a TextIOWrapper object around it in order to encode/decode text?  I would argue "yes", but I'd be curious to know what others think.
msg129200 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-02-23 14:42
What is the problem with Python 3.2? It works correctly here:

$ cat bla.txt
bli
blo
bla
$ gzip bla.txt 
$ ./python 
Python 3.3a0 (unknown, Feb 23 2011, 13:03:50) 
>>> import gzip, io
>>> f = io.TextIOWrapper(gzip.open("bla.txt.gz"),encoding='ascii')
>>> f.read()
'bli\nblo\nbla\n'

If someone added Python 3.2 in the Versions field because of an issue with bz2: please open a new issue instead.
msg129201 - (view) Author: David Beazley (dabeaz) Date: 2011-02-23 14:46
Python 3.2 (r32:88445, Feb 20 2011, 21:51:21) 
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import gzip
>>> import io
>>> f = io.TextIOWrapper(gzip.open("file.gz"),encoding='latin-1')
>>> f.readline()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
io.UnsupportedOperation: read1
>>>
msg129203 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-02-23 14:56
Yes, a clear definition of the minimum requirements for being wrapped by TextIOWrapper sounds like a necessary thing to have (and I'd be inclined to agree with your assertion, but I didn't work on the IO library :).  It would be best to open a new issue for that.
msg129205 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-02-23 15:04
> Yes, a clear definition of the minimum requirements for being wrapped
> by TextIOWrapper sounds like a necessary thing to have

About that: is read1() argument mandatory or not?

In _pyio, BufferedIOBase.read1() argument is optional (default: None); BytesIO.read1(), BufferedReader.read1(), BufferedRWPair.read1(), BufferedRandom.read1() argument is mandatory.

In _io, BufferedIOBase.read1() raises directly an exception, without checking the arguments; BufferedReader.read1() argument is mandatory.

In the io doc, BufferedIOBase.read1() argument is optional (default: -1), BytesIO.read1() has no argument (!) and BufferedReader.read1() argument is mandatory.
msg129215 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-02-23 16:52
It would probably be ok to fallback on read() when read1() isn't implemented. read1() is supposed to be implemented by all BufferedIO-compliant classes, but in all honesty I don't think it's very useful in practice. It's supposed to be an optimization, and I think it's a misguided one; the generalized prefetch() primitive I proposed last year would certainly be more useful: see http://mail.python.org/pipermail/python-dev/2010-September/104194.html
msg132757 - (view) Author: Filip GruszczyƄski (gruszczy) Date: 2011-04-01 16:52
Is following change in GzipFile class enough:

    def read1(self, n):
        return self.read(n)

? This satisfies TextIOWrapper to run readline correctly.
msg132899 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2011-04-03 22:28
> Is following change in GzipFile class enough:
> 
>     def read1(self, n):
>         return self.read(n)
> 
> ? This satisfies TextIOWrapper to run readline correctly.

Looks good to me.

By the way, BZ2File now works correctly - the fix for issue5863 adds read1().
msg132902 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-04-03 22:41
> Nadeem Vawda <nadeem.vawda@gmail.com> added the comment:
> 
> > Is following change in GzipFile class enough:
> > 
> >     def read1(self, n):
> >         return self.read(n)
> > 
> > ? This satisfies TextIOWrapper to run readline correctly.
> 
> Looks good to me.

Well, ideally, read1() should satisfy the condition stated in the
BufferedIOBase documentation - namely, that it issues at most one read()
call on the underlying stream.
msg132931 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2011-04-04 10:56
Here's an implementation of read1() that satisfies that condition, along with
some relevant unit tests.
msg132948 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-04-04 14:18
> Here's an implementation of read1() that satisfies that condition, along with
> some relevant unit tests.

Something looks fishy: what happens if size is -1 and EOFError is not
raised?
msg132953 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2011-04-04 17:01
> Something looks fishy: what happens if size is -1 and EOFError is not raised?

You're right - I missed that possibility. In that case, extrasize and offset get
updated incorrectly, which will break subsequent calls to seek() and tell().
However, it seems that subsequent reads work fine, because slicing a bytes object
with a too-large upper bound doesn't raise an exception.

The attached patch fixes this bug, and updates test_read1() to catch regressions.
msg132959 - (view) Author: Roundup Robot (python-dev) Date: 2011-04-04 19:01
New changeset 9775d67c9af9 by Antoine Pitrou in branch 'default':
Issue #10791: Implement missing method GzipFile.read1(), allowing GzipFile
http://hg.python.org/cpython/rev/9775d67c9af9
msg132960 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-04-04 19:02
Patch now committed, thank you!
Since the patch adds a new API (GzipFile.read1()), I think it's better not to backport it.
History
Date User Action Args
2011-04-07 13:20:42jceasetnosy: + jcea
2011-04-04 19:02:53pitrousetstatus: open -> closed
versions: - Python 2.7, Python 3.2
type: behavior -> enhancement
messages: + msg132960

resolution: fixed
stage: needs patch -> resolved
2011-04-04 19:01:58python-devsetnosy: + python-dev
messages: + msg132959
2011-04-04 17:01:25nadeem.vawdasetfiles: + gzipfile_read1.diff

messages: + msg132953
2011-04-04 14:18:21pitrousetmessages: + msg132948
2011-04-04 10:56:01nadeem.vawdasetfiles: + gzipfile_read1.diff
keywords: + patch
messages: + msg132931
2011-04-03 22:41:02pitrousetmessages: + msg132902
2011-04-03 22:28:48nadeem.vawdasetmessages: + msg132899
2011-04-02 10:27:46nadeem.vawdasetnosy: + nadeem.vawda
2011-04-01 16:52:06gruszczysetnosy: + gruszczy
messages: + msg132757
2011-02-23 18:26:10alexsetnosy: + alex
2011-02-23 16:52:33pitrousetnosy: pitrou, haypo, eric.araujo, r.david.murray, dabeaz
messages: + msg129215
2011-02-23 15:04:04hayposetnosy: pitrou, haypo, eric.araujo, r.david.murray, dabeaz
messages: + msg129205
2011-02-23 14:56:00r.david.murraysetnosy: pitrou, haypo, eric.araujo, r.david.murray, dabeaz
messages: + msg129203
versions: + Python 3.2, Python 3.3
2011-02-23 14:46:41dabeazsetnosy: pitrou, haypo, eric.araujo, r.david.murray, dabeaz
messages: + msg129201
2011-02-23 14:42:58hayposetnosy: + haypo

messages: + msg129200
versions: - Python 3.2
2011-02-23 14:37:08dabeazsetnosy: pitrou, eric.araujo, r.david.murray, dabeaz
messages: + msg129199
2011-02-23 14:29:18r.david.murraysetnosy: pitrou, eric.araujo, r.david.murray, dabeaz
messages: + msg129198
2011-02-23 13:14:41dabeazsetnosy: pitrou, eric.araujo, r.david.murray, dabeaz
messages: + msg129187
2011-01-02 18:34:37eric.araujosetnosy: + eric.araujo
2010-12-29 23:17:37dabeazsetnosy: pitrou, r.david.murray, dabeaz
messages: + msg124888
2010-12-29 23:14:21pitrousetnosy: pitrou, r.david.murray, dabeaz
messages: + msg124887
2010-12-29 23:12:11dabeazsetnosy: pitrou, r.david.murray, dabeaz
messages: + msg124886
2010-12-29 23:10:48pitrousetnosy: pitrou, r.david.murray, dabeaz
messages: + msg124885
2010-12-29 23:05:18r.david.murraysetnosy: pitrou, r.david.murray, dabeaz
messages: + msg124884
2010-12-29 22:58:09dabeazsetnosy: pitrou, r.david.murray, dabeaz
messages: + msg124882
2010-12-29 22:56:33pitrousetnosy: pitrou, r.david.murray, dabeaz
messages: + msg124881
2010-12-29 22:55:00r.david.murraysetnosy: pitrou, r.david.murray, dabeaz
messages: + msg124880
2010-12-29 22:49:32dabeazsetnosy: pitrou, r.david.murray, dabeaz
messages: + msg124879
2010-12-29 22:12:09pitrousetnosy: pitrou, r.david.murray, dabeaz
messages: + msg124878
2010-12-29 22:01:39r.david.murraysetnosy: pitrou, r.david.murray, dabeaz
messages: + msg124877
versions: + Python 2.7
2010-12-29 22:00:18r.david.murraysetnosy: pitrou, r.david.murray, dabeaz
messages: + msg124876
versions: - Python 3.1, Python 2.7
2010-12-29 21:59:11r.david.murraysetversions: + Python 3.1, Python 2.7
nosy: + r.david.murray, pitrou

messages: + msg124875

stage: needs patch
2010-12-29 19:53:51dabeazcreate