Message 81121 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Chris.Barker
Recipients	Chris.Barker
Date	2009-02-04.01:05:36
SpamBayes Score	4.440892e-16
Marked as misclassified	No
Message-id	<1233709542.39.0.638969424034.issue5148@psf.upfronthosting.co.za>
In-reply-to

Content
If you pass the 'U' (Universal newlines) flag into gzip.open(), the flag gets passed into the file open command used to open the gzip file itself. As the 'U' flag can cause changes in teh data (Lineffed translation), when it is used with a binary file open, the data is corrupted, and all can go to heck. In virtually all of my code that reads text files, I use the 'U' flag to open files, it really helps not having to deal with newline issues. Yes, they are fewer now that the Macintosh uses \n, but they can still be a pain. Anyway, we added such support to some matplotlib methods, and found that gzip file reading broken We were passing the flags though into either file() or gzip.open(), and passing 'U' into gzip.open() turns out to be fatal. 1) It would be nice if the gzip module (and the zip lib module) supported Universal newlines -- you could read a compressed text file with "wrong" newlines, and have them handled properly. However, that may be hard to do, so at least: 2) Passing a 'U' flag in to gzip.open shouldn't break it -- it shuld be ignored or raise an exeption. I took a look at the Python SVN (2.5.4 and 2.6.1) for the gzip lib. I see this: # guarantee the file is opened in binary mode on platforms # that care about that sort of thing if mode and 'b' not in mode: mode += 'b' if fileobj is None: fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb') this is going to break for 'U' == you'll get 'rUb'. I tested file(filename, 'rUb'), and it looks like it does universal newline translation. So: * Either gzip should be a bit smarter, and remove the 'U' flag (that's what we did in the MPL code), or force 'rb' or 'wb'. * Or: file opening should be a bit smarter -- what does 'rUb' mean? a file can't be both Binary and Universal Text. Should it raise an exception? Somehow I think it would be better to ignore the 'U', but maybe that's only because of the issue I happen to be looking at now. That later seems a better idea -- this issue could certainly come up in other places than the gzip module, but maybe it would break a bunch of code -- who knows? I haven't touched py3 yet, so I have not idea if this issue is different there. NOTE: passing in the 'U' flag doesn't guarantee that gzi will break. The right combination of bytes needs to be there. In fact, when I first tested this with a small test file, it worked just fine -- I though gzip was ignoring the flag. However, when tested with a larger (real) gz file, it did break. very simple patch: Add: mode.replace('U', '') to the above code before opeing the file But we may want to do something smarter... see the (limited) discussion at: http://mail.python.org/pipermail/python-dev/2009-January/085662.html

If you pass the 'U' (Universal newlines) flag into gzip.open(), the flag
gets passed into the file open command used to open the gzip file
itself. As the 'U' flag can cause changes in teh data (Lineffed
translation), when it is used with a binary file open, the data is
corrupted, and all can go to heck.

In virtually all of my code that reads text files, I use the 'U' flag to
open files, it really helps not having to deal with newline issues. Yes,
they are fewer now that the Macintosh uses \n, but they can still be a pain.

Anyway, we added such support to some matplotlib methods, and found that
gzip file reading broken We were passing the flags though into either
file() or gzip.open(), and passing 'U' into gzip.open() turns out to be
fatal.

1) It would be nice if the gzip module (and the zip lib module)
supported Universal newlines -- you could read a compressed text file
with "wrong" newlines, and have them handled properly. However, that may
be hard to do, so at least:

2) Passing a 'U' flag in to gzip.open shouldn't break it -- it shuld be
ignored or raise an exeption.

I took a look at the Python SVN (2.5.4 and 2.6.1) for the gzip lib. I
see this:


        # guarantee the file is opened in binary mode on platforms
        # that care about that sort of thing
        if mode and 'b' not in mode:
            mode += 'b'
        if fileobj is None:
            fileobj = self.myfileobj = __builtin__.open(filename, mode
or 'rb')

this is going to break for 'U' == you'll get 'rUb'. I tested
file(filename, 'rUb'), and it looks like it does universal newline
translation.

So:

* Either gzip should be a bit smarter, and remove the 'U' flag (that's
what we did in the MPL code), or force 'rb' or 'wb'.

* Or: file opening should be a bit smarter -- what does 'rUb' mean? a
file can't be both Binary and Universal Text. Should it raise an
exception? Somehow I think it would be better to ignore the 'U', but
maybe that's only because of the issue I happen to be looking at now.

That later seems a better idea -- this issue could certainly come up in
other places than the gzip module, but maybe it would break a bunch of
code -- who knows?

I haven't touched py3 yet, so I have not idea if this issue is different
there. 


NOTE: passing in the 'U' flag doesn't guarantee that gzi will break. The
right combination of bytes needs to be there. In fact, when I first
tested this with a small test file, it worked just fine -- I though gzip
was ignoring the flag. However, when tested with a larger (real) gz
file, it did break.

very simple patch:

Add:

mode.replace('U', '')

to the above code before opeing the file 

But we may want to do something smarter...

see the (limited) discussion at:

http://mail.python.org/pipermail/python-dev/2009-January/085662.html

History
Date	User	Action	Args
2009-02-04 01:05:42	Chris.Barker	set	recipients: + Chris.Barker
2009-02-04 01:05:42	Chris.Barker	set	messageid: <1233709542.39.0.638969424034.issue5148@psf.upfronthosting.co.za>
2009-02-04 01:05:41	Chris.Barker	link	issue5148 messages
2009-02-04 01:05:37	Chris.Barker	create