classification
Title: fileinput.hook_compressed returning bytes from gz file
Type: behavior Stage: needs patch
Components: Library (Lib) Versions: Python 3.2, Python 3.3
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: eric.araujo Nosy List: amaury.forgeotdarc, eric.araujo, mnewman, r.david.murray, tati_alchueyr, vstinner
Priority: normal Keywords: easy, patch

Created on 2009-04-14 23:48 by mnewman, last changed 2012-03-13 15:20 by eric.araujo.

Files
File name Uploaded Description Edit
example.zip mnewman, 2009-04-14 23:47 ZIP file containing test example
issue5758_fileinput_gzip_with_encoding.patch tati_alchueyr, 2012-03-13 00:10 Fixed issue #5758. fileinput.hook_compressed returns string (and not bytes) from gz file again. review
issue5758_fileinput_gzip_with_encoding_v2.patch tati_alchueyr, 2012-03-13 07:36 New version of the patch after reviewer's considerations review
Messages (15)
msg85978 - (view) Author: Michael Newman (mnewman) Date: 2009-04-14 23:47
The attached ZIP file contains "test.bat" which runs "test.py" with
Python 2.6 and Python 3.0.

Python 2.6 behaves as expected (see "py26.out"), since it returns
strings from both "mike.txt" and "mike.txt.gz". However, the same test
with Python 3.0 returns bytes from "mike.txt.gz", as shown in "py30.out":
Output: Hello from Mike.
Output: This is the second line.
Output: Why did the robot cross the road?
Output: b'Hello from Mike.'
Output: b'This is the second line.'
Output: b'Why did the robot cross the road?'

For reference, I tested this on Python versions:
Python 2.6.1 (r261:67517, Dec  4 2008, 16:51:00) [MSC v.1500 32 bit
(Intel)] on win32
Python 3.0.1 (r301:69561, Feb 13 2009, 20:04:18) [MSC v.1500 32 bit
(Intel)] on win32
msg111063 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-07-21 14:23
gzip.open() only implements the "rb" mode, and returns bytes.
fileinput could certainly wrap it with a io.TextIOWrapper.

Then the encoding issue arises.
fileinput.FileInput should grow an "encoding" parameter instead of always relying on the default encoding.
msg155471 - (view) Author: Tatiana Al-Chueyr (tati_alchueyr) * Date: 2012-03-12 20:09
Working on this
msg155532 - (view) Author: Tatiana Al-Chueyr (tati_alchueyr) * Date: 2012-03-13 00:10
I had a lot of help of R. David Murray, could someone review our patch, please?

We are specially concerned about backward compatibility.
msg155538 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-03-13 00:30
Thanks for the patch.  I commented on not-so-important things on the code review site (which should have sent an email to Tatiana) and wanted to raise an issue here.

Is it best that the encoding used defaults to Python’s default encoding (UTF-8) or the I/O default encoding (depends on the OS and locale), like what open does?
msg155540 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-03-13 00:34
I think it should use the same default as open, but frankly I couldn't remember the right way to achieve that.  Suggestions welcome :)
msg155543 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-03-13 00:40
We can use locale.getpreferredencoding, or be smarter and just pass down encoding=encoding to TextIOWrapper and let *it* call getpreferredencoding if it’s given a None :)
msg155548 - (view) Author: Tatiana Al-Chueyr (tati_alchueyr) * Date: 2012-03-13 01:03
Excellent! Thanks for all remarks, Éric! I'll follow your suggestions and will submit a new patch.
msg155553 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-03-13 01:26
Duh.  I should have remembered that the io module used None to represent the default encoding.  Thanks, Éric.
msg155554 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-03-13 01:27
Nobody can remember everything: I tried in a shell and looked at the docs to find it :)
msg155577 - (view) Author: Tatiana Al-Chueyr (tati_alchueyr) * Date: 2012-03-13 07:36
Improved patch, according to eric.araujo's suggestions and mnewman's guidance.
msg155595 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-03-13 12:11
Thanks.  Unless another core dev wants to do a complementary review I will slightly tweak the patch and commit it.  I need to finish waking up and eat some food before I do that :)

Technically adding a new argument means that this is a new feature and cannot be applied to the stable 3.2 version, but something needs to be done for this bug in 3.2 too, like a recipe in the docs for a hook_compressed that returns strings (i.e. a function that calls fileinput.hook_compressed and wraps it in a TextIOWrapper), or at least a note to warn about this bug.
msg155618 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-03-13 14:41
For 3.2 could we use the same fix, but without exposing the ability to *change* the encoding?  That is, we use TextIOWrapper but always with the default None for encoding.

It also occurs to me that this really exposes a weakness in the design.  What if the user wants to specify other open parameters?  I wonder if we should say that for better future-proofing openhooks should always take **kw.  You could even envision fileinput accepting **kw and passing them along to the openhook.  I think charset is the most important open paramenter in this context, though, so I don't think we have to solve the general problem in this fix.
msg155622 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-03-13 15:11
It also occurs to me that this fix makes the charset hook look rather odd.  We could render it redundant by passing charset to open in the non-openhook case, and mark it deprecated.

There is also a bug in the hook_encoding docs.  It says the file is opened with codecs.open, but that is not the case, regular open is used.
msg155624 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-03-13 15:20
> For 3.2 could we use the same fix, but without exposing the ability to *change* the encoding?
> That is, we use TextIOWrapper but always with the default None for encoding.
Yes!

> It also occurs to me that this really exposes a weakness in the design.  What if the user wants to
> specify other open parameters?  I wonder if we should say that for better future-proofing openhooks
> should always take **kw.  You could even envision fileinput accepting **kw and passing them along
> to the openhook.  I think charset is the most important open paramenter in this context, though, so
> I don't think we have to solve the general problem in this fix.
I concur.  I’ve never had to care about buffering for example, but mode is another parameter of open that people may want to give.  I’ll commit the minimal fix to 3.2 and merge in 3.3, and then we can discuss on a new RFE bug about adding encoding vs. **kwargs for 3.3.

Agreed on deprecating the charset hook when it becomes redundant.

Will fix the doc bug about codecs.open too.
History
Date User Action Args
2012-03-13 15:20:46eric.araujosetmessages: + msg155624
2012-03-13 15:11:24r.david.murraysetmessages: + msg155622
2012-03-13 14:41:45r.david.murraysetmessages: + msg155618
2012-03-13 12:11:43eric.araujosetassignee: eric.araujo
messages: + msg155595
versions: + Python 3.3, - Python 3.1
2012-03-13 07:36:44tati_alchueyrsetfiles: + issue5758_fileinput_gzip_with_encoding_v2.patch

messages: + msg155577
2012-03-13 01:27:17eric.araujosetmessages: + msg155554
2012-03-13 01:26:25r.david.murraysetmessages: + msg155553
2012-03-13 01:03:13tati_alchueyrsetmessages: + msg155548
2012-03-13 00:40:43eric.araujosetmessages: + msg155543
2012-03-13 00:34:20r.david.murraysetnosy: + r.david.murray
messages: + msg155540
2012-03-13 00:30:35eric.araujosetnosy: + eric.araujo, vstinner
messages: + msg155538
2012-03-13 00:10:05tati_alchueyrsetfiles: + issue5758_fileinput_gzip_with_encoding.patch
keywords: + patch
messages: + msg155532
2012-03-12 20:09:15tati_alchueyrsetnosy: + tati_alchueyr
messages: + msg155471
2010-07-21 14:23:43amaury.forgeotdarcsetversions: - Python 2.7
nosy: + amaury.forgeotdarc

messages: + msg111063

keywords: + easy
2010-07-10 16:20:54BreamoreBoysetstage: needs patch
versions: + Python 3.1, Python 2.7, Python 3.2, - Python 3.0
2009-04-14 23:48:01mnewmancreate