Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrapping TextIOWrapper around gzip files #55000

Closed
dabeaz mannequin opened this issue Dec 29, 2010 · 30 comments
Closed

Wrapping TextIOWrapper around gzip files #55000

dabeaz mannequin opened this issue Dec 29, 2010 · 30 comments
Labels
type-feature A feature request or enhancement

Comments

@dabeaz
Copy link
Mannequin

dabeaz mannequin commented Dec 29, 2010

BPO 10791
Nosy @jcea, @pitrou, @vstinner, @merwok, @alex, @bitdancer
Files
  • gzipfile_read1.diff: Add GzipFile.read1() plus tests
  • gzipfile_read1.diff: Fix bug in previous patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2011-04-04.19:02:53.108>
    created_at = <Date 2010-12-29.19:53:51.701>
    labels = ['type-feature']
    title = 'Wrapping TextIOWrapper around gzip files'
    updated_at = <Date 2011-04-07.13:20:42.278>
    user = 'https://bugs.python.org/dabeaz'

    bugs.python.org fields:

    activity = <Date 2011-04-07.13:20:42.278>
    actor = 'jcea'
    assignee = 'none'
    closed = True
    closed_date = <Date 2011-04-04.19:02:53.108>
    closer = 'pitrou'
    components = []
    creation = <Date 2010-12-29.19:53:51.701>
    creator = 'dabeaz'
    dependencies = []
    files = ['21531', '21532']
    hgrepos = []
    issue_num = 10791
    keywords = ['patch']
    message_count = 30.0
    messages = ['124870', '124875', '124876', '124877', '124878', '124879', '124880', '124881', '124882', '124884', '124885', '124886', '124887', '124888', '129187', '129198', '129199', '129200', '129201', '129203', '129205', '129215', '132757', '132899', '132902', '132931', '132948', '132953', '132959', '132960']
    nosy_count = 10.0
    nosy_names = ['jcea', 'pitrou', 'vstinner', 'nadeem.vawda', 'eric.araujo', 'alex', 'r.david.murray', 'gruszczy', 'dabeaz', 'python-dev']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue10791'
    versions = ['Python 3.3']

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Dec 29, 2010

    Is something like this supposed to work:

    >>> import gzip
    >>> import io
    >>> f = io.TextIOWrapper(gzip.open("foo.gz"),encoding='ascii'))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: readable

    In a nutshell--reading a .gz file as text.

    @dabeaz dabeaz mannequin added the type-bug An unexpected behavior, bug, or error label Dec 29, 2010
    @bitdancer
    Copy link
    Member

    Since GZipFile inherits from BufferedIOBase, and TextIOWrapper is supposed to be designed to wrap a BufferedIOBase object, I would say yes it ought to work. On the other hand there may also be a doc error there, since it may be that TextIOWrapper actually needs to wrap one of the subclasses of BufferedIOBase.

    @bitdancer
    Copy link
    Member

    Oops. It only has that inheritance in 3.2.

    @bitdancer
    Copy link
    Member

    Heh, and 2.7. Fixing versions yet again.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 29, 2010

    This should be easy to fix, if only the "readable" and "writable" methods are needed. Do you want to try writing a patch?

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Dec 29, 2010

    It goes without saying that this also needs to be checked with the bz2 module. A quick check seems to indicate that it has the same problem.

    While you're at it, maybe someone could add an 'open' function to bz2 to make it symmetrical with gzip as well :-).

    @bitdancer
    Copy link
    Member

    bz2 is a pure C module, so that's a very different situation.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 29, 2010

    While you're at it, maybe someone could add an 'open' function to bz2
    to make it symmetrical with gzip as well :-).

    That's a nice idea, but quite orthogonal to this issue.

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Dec 29, 2010

    C or not, wrapping a BZ2File instance with a TextIOWrapper to get text still seems like something that someone might want to do. I doubt it would take much modification to give BZ2File instances the required set of methods.

    @bitdancer
    Copy link
    Member

    Right, but in the bz2 case I think it is a feature request rather than a bugfix. In any case it should be a separate issue.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 29, 2010

    C or not, wrapping a BZ2File instance with a TextIOWrapper to get text
    still seems like something that someone might want to do. I doubt it
    would take much modification to give BZ2File instances the required
    set of methods.

    BZ2File uses FILE pointers internally so it may be more complicated than
    it looks to be (because the methods may not have the right semantics).

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Dec 29, 2010

    Do Python devs really view gzip and bz2 as two totally completely different animals? They both have the same functionality and would be used for the same kinds of things. Maybe I'm missing something.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 29, 2010

    Do Python devs really view gzip and bz2 as two totally completely
    different animals? They both have the same functionality and would be
    used for the same kinds of things. Maybe I'm missing something.

    Well, the reality of divergent implementation strategies trumps the
    theory of API compatibility :) The approach taken by bz2 is IMO
    regrettable, but it's not a ten minutes job to write it again from
    scratch.

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Dec 29, 2010

    Hmmm. Interesting. In the big picture, it might be an interesting project for someone (not necessarily the core devs) to sit down and refactor both of these modules so that they play nice with Python 3 I/O system. Obviously that's a project outside the scope of this bug or the 3.2 release for that matter.

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Feb 23, 2011

    Bump. This is still broken in Python 3.2.

    @bitdancer
    Copy link
    Member

    If a patch had been proposed it probably would have gotten in to 3.2. Maybe someone (perhaps you?) will find the time before 3.2.1.

    Someone has decided to work on the bz2 rewrite, by the way (bpo-5863).

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Feb 23, 2011

    If I can find some time, I may took a look at this. I just noticed that similar problems arise trying to wrap TextIOWrapper around the file-like objects returned by urllib.request.urlopen as well.

    In the big picture, some discussion of what it means to be "file-like" might be in order. If something is "file-like" and binary, should that always imply that I be able to wrap a TextIOWrapper object around it in order to encode/decode text? I would argue "yes", but I'd be curious to know what others think.

    @vstinner
    Copy link
    Member

    What is the problem with Python 3.2? It works correctly here:

    $ cat bla.txt
    bli
    blo
    bla
    $ gzip bla.txt 
    $ ./python 
    Python 3.3a0 (unknown, Feb 23 2011, 13:03:50) 
    >>> import gzip, io
    >>> f = io.TextIOWrapper(gzip.open("bla.txt.gz"),encoding='ascii')
    >>> f.read()
    'bli\nblo\nbla\n'

    If someone added Python 3.2 in the Versions field because of an issue with bz2: please open a new issue instead.

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Feb 23, 2011

    Python 3.2 (r32:88445, Feb 20 2011, 21:51:21) 
    [GCC 4.2.1 (Apple Inc. build 5664)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import gzip
    >>> import io
    >>> f = io.TextIOWrapper(gzip.open("file.gz"),encoding='latin-1')
    >>> f.readline()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    io.UnsupportedOperation: read1
    >>>

    @bitdancer
    Copy link
    Member

    Yes, a clear definition of the minimum requirements for being wrapped by TextIOWrapper sounds like a necessary thing to have (and I'd be inclined to agree with your assertion, but I didn't work on the IO library :). It would be best to open a new issue for that.

    @vstinner
    Copy link
    Member

    Yes, a clear definition of the minimum requirements for being wrapped
    by TextIOWrapper sounds like a necessary thing to have

    About that: is read1() argument mandatory or not?

    In _pyio, BufferedIOBase.read1() argument is optional (default: None); BytesIO.read1(), BufferedReader.read1(), BufferedRWPair.read1(), BufferedRandom.read1() argument is mandatory.

    In _io, BufferedIOBase.read1() raises directly an exception, without checking the arguments; BufferedReader.read1() argument is mandatory.

    In the io doc, BufferedIOBase.read1() argument is optional (default: -1), BytesIO.read1() has no argument (!) and BufferedReader.read1() argument is mandatory.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 23, 2011

    It would probably be ok to fallback on read() when read1() isn't implemented. read1() is supposed to be implemented by all BufferedIO-compliant classes, but in all honesty I don't think it's very useful in practice. It's supposed to be an optimization, and I think it's a misguided one; the generalized prefetch() primitive I proposed last year would certainly be more useful: see http://mail.python.org/pipermail/python-dev/2010-September/104194.html

    @gruszczy
    Copy link
    Mannequin

    gruszczy mannequin commented Apr 1, 2011

    Is following change in GzipFile class enough:

        def read1(self, n):
            return self.read(n)

    ? This satisfies TextIOWrapper to run readline correctly.

    @nadeemvawda
    Copy link
    Mannequin

    nadeemvawda mannequin commented Apr 3, 2011

    Is following change in GzipFile class enough:

    def read1(self, n):
        return self.read(n)
    

    ? This satisfies TextIOWrapper to run readline correctly.

    Looks good to me.

    By the way, BZ2File now works correctly - the fix for bpo-5863 adds read1().

    @pitrou
    Copy link
    Member

    pitrou commented Apr 3, 2011

    Nadeem Vawda <nadeem.vawda@gmail.com> added the comment:

    > Is following change in GzipFile class enough:
    >
    > def read1(self, n):
    > return self.read(n)
    >
    > ? This satisfies TextIOWrapper to run readline correctly.

    Looks good to me.

    Well, ideally, read1() should satisfy the condition stated in the
    BufferedIOBase documentation - namely, that it issues at most one read()
    call on the underlying stream.

    @nadeemvawda
    Copy link
    Mannequin

    nadeemvawda mannequin commented Apr 4, 2011

    Here's an implementation of read1() that satisfies that condition, along with
    some relevant unit tests.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 4, 2011

    Here's an implementation of read1() that satisfies that condition, along with
    some relevant unit tests.

    Something looks fishy: what happens if size is -1 and EOFError is not
    raised?

    @nadeemvawda
    Copy link
    Mannequin

    nadeemvawda mannequin commented Apr 4, 2011

    Something looks fishy: what happens if size is -1 and EOFError is not raised?

    You're right - I missed that possibility. In that case, extrasize and offset get
    updated incorrectly, which will break subsequent calls to seek() and tell().
    However, it seems that subsequent reads work fine, because slicing a bytes object
    with a too-large upper bound doesn't raise an exception.

    The attached patch fixes this bug, and updates test_read1() to catch regressions.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Apr 4, 2011

    New changeset 9775d67c9af9 by Antoine Pitrou in branch 'default':
    Issue bpo-10791: Implement missing method GzipFile.read1(), allowing GzipFile
    http://hg.python.org/cpython/rev/9775d67c9af9

    @pitrou
    Copy link
    Member

    pitrou commented Apr 4, 2011

    Patch now committed, thank you!
    Since the patch adds a new API (GzipFile.read1()), I think it's better not to backport it.

    @pitrou pitrou closed this as completed Apr 4, 2011
    @pitrou pitrou added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Apr 4, 2011
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants