Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gzip.open breaks with 'U' flag #49398

Closed
ChrisBarker mannequin opened this issue Feb 4, 2009 · 8 comments
Closed

gzip.open breaks with 'U' flag #49398

ChrisBarker mannequin opened this issue Feb 4, 2009 · 8 comments
Labels
easy stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@ChrisBarker
Copy link
Mannequin

ChrisBarker mannequin commented Feb 4, 2009

BPO 5148
Nosy @jackdied
Files
  • gzipU.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2012-10-21.16:34:00.295>
    created_at = <Date 2009-02-04.01:05:41.172>
    labels = ['easy', 'type-bug', 'library']
    title = "gzip.open breaks with  'U' flag"
    updated_at = <Date 2012-10-21.16:34:00.293>
    user = 'https://bugs.python.org/ChrisBarker'

    bugs.python.org fields:

    activity = <Date 2012-10-21.16:34:00.293>
    actor = 'nadeem.vawda'
    assignee = 'none'
    closed = True
    closed_date = <Date 2012-10-21.16:34:00.295>
    closer = 'nadeem.vawda'
    components = ['Library (Lib)']
    creation = <Date 2009-02-04.01:05:41.172>
    creator = 'Chris.Barker'
    dependencies = []
    files = ['12945']
    hgrepos = []
    issue_num = 5148
    keywords = ['patch', 'easy', 'needs review']
    message_count = 8.0
    messages = ['81121', '81158', '81185', '81954', '84207', '92007', '173460', '173462']
    nosy_count = 6.0
    nosy_names = ['jackdied', 'nadeem.vawda', 'Chris.Barker', 'radek768', 'agillesp', 'python-dev']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue5148'
    versions = ['Python 2.7']

    @ChrisBarker
    Copy link
    Mannequin Author

    ChrisBarker mannequin commented Feb 4, 2009

    If you pass the 'U' (Universal newlines) flag into gzip.open(), the flag
    gets passed into the file open command used to open the gzip file
    itself. As the 'U' flag can cause changes in teh data (Lineffed
    translation), when it is used with a binary file open, the data is
    corrupted, and all can go to heck.

    In virtually all of my code that reads text files, I use the 'U' flag to
    open files, it really helps not having to deal with newline issues. Yes,
    they are fewer now that the Macintosh uses \n, but they can still be a pain.

    Anyway, we added such support to some matplotlib methods, and found that
    gzip file reading broken We were passing the flags though into either
    file() or gzip.open(), and passing 'U' into gzip.open() turns out to be
    fatal.

    1. It would be nice if the gzip module (and the zip lib module)
      supported Universal newlines -- you could read a compressed text file
      with "wrong" newlines, and have them handled properly. However, that may
      be hard to do, so at least:

    2. Passing a 'U' flag in to gzip.open shouldn't break it -- it shuld be
      ignored or raise an exeption.

    I took a look at the Python SVN (2.5.4 and 2.6.1) for the gzip lib. I
    see this:

            # guarantee the file is opened in binary mode on platforms
            # that care about that sort of thing
            if mode and 'b' not in mode:
                mode += 'b'
            if fileobj is None:
                fileobj = self.myfileobj = __builtin__.open(filename, mode
    or 'rb')

    this is going to break for 'U' == you'll get 'rUb'. I tested
    file(filename, 'rUb'), and it looks like it does universal newline
    translation.

    So:

    • Either gzip should be a bit smarter, and remove the 'U' flag (that's
      what we did in the MPL code), or force 'rb' or 'wb'.

    • Or: file opening should be a bit smarter -- what does 'rUb' mean? a
      file can't be both Binary and Universal Text. Should it raise an
      exception? Somehow I think it would be better to ignore the 'U', but
      maybe that's only because of the issue I happen to be looking at now.

    That later seems a better idea -- this issue could certainly come up in
    other places than the gzip module, but maybe it would break a bunch of
    code -- who knows?

    I haven't touched py3 yet, so I have not idea if this issue is different
    there.

    NOTE: passing in the 'U' flag doesn't guarantee that gzi will break. The
    right combination of bytes needs to be there. In fact, when I first
    tested this with a small test file, it worked just fine -- I though gzip
    was ignoring the flag. However, when tested with a larger (real) gz
    file, it did break.

    very simple patch:

    Add:

    mode.replace('U', '')

    to the above code before opeing the file

    But we may want to do something smarter...

    see the (limited) discussion at:

    http://mail.python.org/pipermail/python-dev/2009-January/085662.html

    @ChrisBarker ChrisBarker mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Feb 4, 2009
    @smontanaro
    Copy link
    Contributor

    Seems like this should be fairly easy to do right. 'U' needs to be
    removed from the flags but then applied to the lines read from the
    stream.

    @smontanaro smontanaro added the easy label Feb 4, 2009
    @smontanaro
    Copy link
    Contributor

    Here's a patch against trunk. Extra test case and minor doc tweak
    included.

    @radek768
    Copy link
    Mannequin

    radek768 mannequin commented Feb 13, 2009

    Same bug in 2.5, I don't know if the patch applies to 2.5

    @jackdied
    Copy link
    Contributor

    Unfortunately universal newlines are more complicated than replace() can
    handle. See io.py, you may be able to use one of those classes to the
    the universal new line handling on the cheap (or at least easy).

    @agillesp
    Copy link
    Mannequin

    agillesp mannequin commented Aug 27, 2009

    The problem appears to be that the gzip module simply doesn't support
    universal newlines yet.

    I'm currently working on the zipfile module's universal newline support
    (bpo-6759) so if nobody else is working on this, I'll do it.

    I'm not sure if file object's open() behavior when presented with 'rUb'
    is correct or not.

    >>> f = open("test.txt", "w").write("blah\r\nblah\rblah\nblah\r\n")
    >>> f = open("test.txt", "rUb")
    >>> f.read()
    'blah\nblah\nblah\nblah\n'

    Since 'U' and 'b' are conceptually mutually exclusive on platforms where
    'b' matters, I can see this being confusing.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 21, 2012

    New changeset e647229c422b by Nadeem Vawda in branch '2.7':
    Issue bpo-5148: Ignore 'U' in mode given to gzip.open() and gzip.GzipFile().
    http://hg.python.org/cpython/rev/e647229c422b

    @nadeemvawda
    Copy link
    Mannequin

    nadeemvawda mannequin commented Oct 21, 2012

    The data corruption issue is now fixed in the 2.7 branch.

    In 3.x, using a mode containing 'U' results in an exception rather than silent data corruption. Additionally, gzip.open() has supported text modes ("rt"/"wt"/"at") and newline translation since 3.3 [bpo-13989].

    @nadeemvawda nadeemvawda mannequin closed this as completed Oct 21, 2012
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    easy stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants