Issue 7610: Cannot use both read and readline method in same ZipExtFile object

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/51859

classification

Title:	Cannot use both read and readline method in same ZipExtFile object
Type:	behavior	Stage:	resolved
Components:	Extension Modules	Versions:	Python 3.2, Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	pitrou	Nosy List:	ezio.melotti, lucifer, nirai, pitrou, r.david.murray
Priority:	normal	Keywords:	patch

Created on 2009-12-31 08:03 by lucifer, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
zipfile_7610_py27.diff	nirai, 2010-01-03 10:34
zipfile_7610_py27.diff	nirai, 2010-01-04 22:39	Patch for Python 2.7 using the io module.
zipfile_7610_py27.diff	nirai, 2010-01-10 20:38	Updated patch for Python 2.7 using the io module.
zipfile_7610_py27_v4.diff	nirai, 2010-01-14 19:44	Updated patch for Python 2.7 using the io module.
zipfile_7610_py27_v5.diff	nirai, 2010-01-18 21:14	Patch (take 5) for Python 2.7.
zipfile_7610_py27_v6.diff	nirai, 2010-01-19 06:55	Patch v6 for Python 2.7.

Messages (28)
msg97079 - (view)	Author: lucifer (lucifer)	Date: 2009-12-31 08:11
open a file in the zip file through ZipFile.open method, if invoke read method after readline method in the ZipExtFile object, the data is not correct. I was trying to get a ZipExtFile and pass it to pickle.load(f), a exception was thrown. The reason is readline will keep a linebuffer, but read only use readbuffer. After invoke readline, there will be some missing data in linebuffer when invoke read method
msg97158 - (view)	Author: Nir Aides (nirai)	Date: 2010-01-03 10:34
I uploaded a possible patch for Python 2.7. The patch converts ZipExtFile into subclass of io.BufferedIOBase and drops most of the original implementation. However, the patch breaks current newline behavior of ZipExtFile. I figured this was reasonable because zipfile newline behavior: a) was introduced in Python 2.6 b) was incompatible with behavior of io module files. c) was not documented. Reading zip content as text with newline functionality may be done with io.TextIOWrapper. A bonus of sub classing io.BufferedIOBase is significantly better read performance.
msg97165 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-01-03 15:10
I don't think we can remove the "U" option from 2.6/2.7; it was certainly introduced for a reason and isn't inconsistent with the "U" option to the built-in open(). On 3.x, behaviour is indeed inconsistent with the standard IO library, so maybe we can either wrap the object in a TextIOWrapper, or remove support for "U".
msg97230 - (view)	Author: Nir Aides (nirai)	Date: 2010-01-04 22:39
Right, I was reading the 3.1 docs by mistake. I updated the patch. This time universal newlines are supported. On my dataset (75MB 650K lines log file) the readline() speedup is x40 for 'r' mode and x8 for 'rU' mode, and you can get an extra bump by using the io module wrappers. I added tests for interleaved use of readline() and read() (which add ~2 seconds to running time)
msg97235 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-01-04 23:43
> I updated the patch. This time universal newlines are supported. Thank you. Are you sure the "Shortcut common case" in readline() is useful? BufferedIOBase.readline() in itself should be rather fast.
msg97236 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-01-04 23:49
Also, I'm not sure what happens in readline() in universal mode when the chunk ends with a '\r' and there's a '\n' in the following chunk (see the "ugly check" that your patch removes). Is there a test for that?
msg97251 - (view)	Author: Nir Aides (nirai)	Date: 2010-01-05 07:10
> Thank you. Are you sure the "Shortcut common case" in readline() > is useful? BufferedIOBase.readline() in itself should be rather fast. On my dataset the shortcut speeds up readline() 400% on top of the default C implementation. I can take a look to why the C implementation is slow (although it is documented as "slowish"). > Also, I'm not sure what happens in readline() in universal mode when > the chunk ends with a '\r' and there's a '\n' in the following chunk > (see the "ugly check" that your patch removes). Is there a test for that? The regular pattern returns either a line chunk or a newline (sequence) but not both. To read a line there is therefore a minimum of two peek() calls. One for the line content and the last for the newline. Since the peek is called with a value of 2, the newline sequence \r\n should be retrieved as is. There is no test for that.
msg97258 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-01-05 10:58
> Since the peek is called with a value of 2, the newline sequence \r\n > should be retrieved as is. No, it doesn't follow. The \r can still appear at the end of a readahead, in which case your algorithm will not eliminate the following \n. That is, if the sequence of readaheads is ['a\r', '\nb\n'], readlines() will return ['a\n', '\n', 'b\n'] while it should return ['a\n', 'b\n']. It should be possible to construct a statistically valid test case for this, for example by creating an archived file containing 'a\r\n'*10000 and reading back from it.
msg97548 - (view)	Author: Nir Aides (nirai)	Date: 2010-01-10 20:38
If the sequence of readaheads is ['a\r', '\nb\n'], the first use of the pattern will consume 'a', then the peek(2) will trigger a read() and the next use of the pattern will consume '\r\n'. I updated the patch and enhanced a little the inline comments to clear this up and added the suggested test. I also ventured into a rewrite of the read() function since it seems to contain various latent bugs / problems, probably the result of ages of slow refactoring. For example, decompressor flush() depends on self.eof flag which is never set and the decrypter is fed by the raw buffer which is in turn set by the decompresser, thus theoretically feeding decrypted data back to the decrypter. I also made it completely buffered so now small reads are significantly faster. If this patch is good, i'll prepare a patch for python 3.x and 2.6
msg97659 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-01-12 21:34
Some comments: * you should probably write `n = sys.maxsize` instead of `n = 1 << 31 - 1` * ZipExtFile.read() should support `n=None` as a synonym to `n=-1` (read everything) * `bytes` as a variable name isn't very good since it's the built-in name for bytestrings in py3k * in ZipExtFile.read(), it seems you have removed the adjustment for encrypted files (see `adjust read size for encrypted files since the first 12 bytes [etc.]`) * is there a situation where the decompressor might return less bytes than expected? (after all compression doesn't /always/ compress, in unfavourable chunks of data it might actually expand things a bit)
msg97783 - (view)	Author: Nir Aides (nirai)	Date: 2010-01-14 19:44
I uploaded an update for Python 2.7. > * you should probably write `n = sys.maxsize` instead of `n = 1 << 31 - 1` sys.maxsize is 64 bit number on my system but the maximum value accepted by zlib's decompress() seems to be INT_MAX defined in pyport.h which equals the number I used. > * ZipExtFile.read() should support `n=None` as a synonym to `n=-1` > (read everything) Added > * `bytes` as a variable name isn't very good since it's the built-in > name for bytestrings in py3k Changed (old habits die hard). > * in ZipExtFile.read(), it seems you have removed the adjustment for > encrypted files (see `adjust read size for encrypted files since the > first 12 bytes [etc.]`) Yes, was moved to the constructor. > * is there a situation where the decompressor might return less bytes > than expected? (after all compression doesn't /always/ compress, in > unfavourable chunks of data it might actually expand things a bit) The documentation of io.BufferedIOBase.read() reads "multiple raw reads may be issued to satisfy the byte count". I understood this language to mean satisfying read size is optional. Isn't it?
msg97830 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-01-15 18:38
> The documentation of io.BufferedIOBase.read() reads "multiple raw > reads may be issued to satisfy the byte count". I understood this > language to mean satisfying read size is optional. Isn't it? It's the reverse actually. It means that `BufferedIOBase.read` itself may (or perhaps should) issue multiple raw reads in order to satisfy the byte count.
msg97838 - (view)	Author: Nir Aides (nirai)	Date: 2010-01-15 19:27
May be a good idea to clear this up in the documentation. http://en.wiktionary.org/wiki/may#Verb "(modal auxiliary verb, defective) To have permission to. Used in granting permission and in questions to make polite requests."
msg97848 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-01-15 21:18
I do not find the existing phrasing in the IO docs ambiguous, but since it is obviously possible to misinterpret it it would be good to clarify it. Can you suggest an alternate phrasing that would be clearer?
msg97974 - (view)	Author: Nir Aides (nirai)	Date: 2010-01-17 21:13
> I do not find the existing phrasing in the IO docs ambiguous, but since > it is obviously possible to misinterpret it it would be good to clarify > it. Can you suggest an alternate phrasing that would be clearer? Replace 'may' with 'will' or 'shall' everywhere the context indicates a mandatory requirement. Since this possibly affects the entire Python documentation, does it make sense to discuss this on python-dev?
msg97977 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-01-17 21:29
> Replace 'may' with 'will' or 'shall' everywhere the context indicates > a mandatory requirement. > > Since this possibly affects the entire Python documentation, does it > make sense to discuss this on python-dev? Either that, or open a separate tracker issue. I'm not a documentation specialist and would prefer to gather other opinions.
msg98033 - (view)	Author: Nir Aides (nirai)	Date: 2010-01-18 21:14
Uploaded an updated patch with read() which calls underlying stream enough times to satisfy required read size.
msg98040 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-01-18 23:39
The patch looks rather good. Is `self.MAX_N` still necessary in read()? I guess it's rare to read more than 2GB at once, though...
msg98046 - (view)	Author: Nir Aides (nirai)	Date: 2010-01-19 06:55
Right, removed MAX_N from read(); remains in read1(). If good, what versions of Python is this patch desired for?
msg98449 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-01-27 21:20
The patch has been committed in r77798 (trunk) and r77800 (py3k). Thank you! I won't commit it to 2.6 and 3.1 because it's too involved to qualify as a bug fix, though.
msg98455 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-01-28 01:51
Nir, could you also provide a test for the part that "handles unconsumed data" (line 601 in zipfile.py)? In r77809 (and r77810) I made a change to avoid using zlib when it's not necessary (zlib is not always available), and I was going to commit a typo that luckily I noticed in the diff. The tests however didn't notice anything because that part is untested.
msg98459 - (view)	Author: Nir Aides (nirai)	Date: 2010-01-28 06:55
The related scenario is a system without zlib. How do you suggest simulating this in test?
msg98460 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-01-28 07:07
The test should just check that the part that handles unconsumed data works when zlib is available. AFAIU if zlib is not available this part (i.e. the content of the if) can be skipped so it doesn't need to be tested. (When zlib is not available 'zlib' is set to None, see the try/except at the beginning of zipfile.py)
msg98461 - (view)	Author: Nir Aides (nirai)	Date: 2010-01-28 07:35
Unconsumed data is compressed data. If the part which handles unconsumed data does not work when zlib is available, then the existing tests would fail. In any case the unconsumed buffer is an implementation detail of zipfile. I see a point in adding a test to make sure zipfile behaves as expected when zlib is not available, but how? Also, on which systems is zlib missing? I don't see this mentioned in the zlib docs.
msg98462 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-01-28 07:56
> If the part which handles unconsumed data does not work when zlib is available, then the existing tests would fail. If the existing tests end up testing that part of code too then it's probably fine. I tried to add a print inside the 'if' (at line 605) but it didn't print anything. However here some tests are skipped, so maybe that part of code is tested in some of these skipped tests. > I see a point in adding a test to make sure zipfile behaves as expected when zlib is not available, but how? Several tests are already skipped in test_zipfile.py when zlib is not available (all the ones with the skipUnless decorator) > Also, on which systems is zlib missing? I don't see this mentioned in the zlib docs. Even if it's available and usually already installed in most of the systems, in some -- like mine (Linux) -- is not installed. If you can confirm that that part of code is already tested in some of the tests that here are skipped, I will close the issue.
msg98471 - (view)	Author: Nir Aides (nirai)	Date: 2010-01-28 18:22
I actually meant how would you simulate zlib's absence on a system in which it is present?
msg98473 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-01-28 18:31
The easiest way is to setting zlib to None or not import it at all. Are you suggesting that test_zipfile should be always run with and without zlib to check that everything (except the things that require zlib of course) works in both the cases? My concern was about that part that handles unconsumed data only, because I didn't see any test in the patch that specifically check it. If there are already tests that checks it, then it's fine, even if it's not possible to test that part without zlib.
msg98478 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-01-28 20:43
Apparently that part of code is already tested in other tests that use deflated mode, so I'll close this again. Thanks for the info.

History
Date	User	Action	Args
2022-04-11 14:56:55	admin	set	github: 51859
2010-02-09 16:08:54	amaury.forgeotdarc	link	issue7216 superseder
2010-01-28 20:43:33	ezio.melotti	set	status: open -> closed messages: + msg98478 stage: test needed -> resolved
2010-01-28 18:31:02	ezio.melotti	set	messages: + msg98473
2010-01-28 18:22:26	nirai	set	messages: + msg98471
2010-01-28 07:56:20	ezio.melotti	set	messages: + msg98462
2010-01-28 07:35:57	nirai	set	messages: + msg98461
2010-01-28 07:07:24	ezio.melotti	set	messages: + msg98460
2010-01-28 06:55:29	nirai	set	messages: + msg98459
2010-01-28 01:51:45	ezio.melotti	set	status: closed -> open nosy: + ezio.melotti messages: + msg98455 stage: resolved -> test needed
2010-01-27 21:20:16	pitrou	set	status: open -> closed resolution: accepted -> fixed messages: + msg98449 stage: patch review -> resolved
2010-01-27 20:50:00	pitrou	set	assignee: pitrou resolution: accepted versions: + Python 2.7
2010-01-19 06:55:40	nirai	set	files: + zipfile_7610_py27_v6.diff messages: + msg98046
2010-01-18 23:39:27	pitrou	set	messages: + msg98040
2010-01-18 21:14:26	nirai	set	files: + zipfile_7610_py27_v5.diff messages: + msg98033
2010-01-17 21:29:52	pitrou	set	messages: + msg97977
2010-01-17 21:13:22	nirai	set	messages: + msg97974
2010-01-15 21:18:19	r.david.murray	set	nosy: + r.david.murray messages: + msg97848
2010-01-15 19:27:05	nirai	set	messages: + msg97838
2010-01-15 18:38:26	pitrou	set	messages: + msg97830
2010-01-14 19:44:18	nirai	set	files: + zipfile_7610_py27_v4.diff messages: + msg97783
2010-01-12 21:34:34	pitrou	set	messages: + msg97659
2010-01-10 20:38:37	nirai	set	files: + zipfile_7610_py27.diff messages: + msg97548
2010-01-05 10:58:26	pitrou	set	messages: + msg97258
2010-01-05 07:10:37	nirai	set	messages: + msg97251
2010-01-04 23:49:38	pitrou	set	messages: + msg97236
2010-01-04 23:43:57	pitrou	set	messages: + msg97235
2010-01-04 22:39:33	nirai	set	files: + zipfile_7610_py27.diff messages: + msg97230
2010-01-03 15:10:11	pitrou	set	priority: normal versions: + Python 3.2, - Python 2.6 nosy: + pitrou messages: + msg97165 stage: patch review
2010-01-03 10:34:05	nirai	set	files: + zipfile_7610_py27.diff keywords: + patch messages: + msg97158
2009-12-31 12:27:32	nirai	set	nosy: + nirai
2009-12-31 08:11:22	lucifer	set	messages: + msg97079
2009-12-31 08:03:35	lucifer	create