New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More flexibility in zipfile write interface #70227
Comments
I was working recently on some code writing zip files, and needed a bit more flexibility than the interface of zipfile provides. To do what I wanted, I ended up copying the entirety of ZipFile.write() into my own code with modifications. That's ugly, and to make it worse, the code I copied from the Python 3.4 stdlib didn't work properly with Python 3.5. Specifically, I want to:
I could do both by using ZipFile.writestr(), but then the entire file must be loaded into memory at once, which I would much rather avoid. The patch attached is one proposal which would make it easier to do what I want, but it's intended as a starting point for discussion. I'm not particularly attached to the API.
|
from the last blog post of Brett Cannon, I would say that he might be interested in Participating to this discussion (Rewrite zipimport from scratch section) Though I'm not sure about the etiquette to ping someone on such an issue. |
bpo-11980 has a patch that adds a method for writing a file into the ZIP archive. But more flexible way is to add the support of the "w" mode to ZipFile.open(). This will allow to write a data that doesn't exist in memory at once, nor in disk, but is loaded by chunks from the pipe, or from the socket, or from other archive, or just generated on fly. I'm worked on this, but this feature needed to add support of some other features. For example writing to unseekable file (already implemented) or writing and reading a file without known size (not implemented still). bpo-18595 has a proposition to add special methods for creating symlinks, directories etc. |
Attached is a first go at a patch enabling zipfile.open(blah, mode='w') Files must be written sequentially, so you have to close one writing handle before opening another. If you try to open a second one before closing the first, it will raise RuntimeError. I considered doing something where it would write to temporary files and add them to the zip file when they were closed, but it seemed like a bad idea. You can almost certainly break this by reading from a zip file while there's an open writing handle. Resolving this is tricky because there's a disconnect in the requirements for reading and writing: writing allows for a non-seekable output stream, but reading assumes that you can seek freely. The simplest fix is to block reading while there is an open file handle. I don't think many people will need to read one file from a zip while writing another, anyway. I have used the lock, but I haven't thought carefully about thread safety, so that should still be checked carefully. |
zipinfo-from-file.patch has an orthogonal but related change: the code in ZipFile.write() to construct a ZipInfo object from a filesystem file is pulled out to a classmethod ZipInfo.from_file(). Together, these changes make it much easier to control how a file is written to a zip file, like this: zi = ZipInfo.from_file(blah)
# ... manipulate zi...
with open(blah, 'rb') as src, zf.open(zi, 'w') as dest:
# copy of the read/write loop - maybe this should be
# pulled out separately as well? If these changes make it in, I might put a backported copy of the module on PyPI so I can start using it without waiting for Python 3.6. |
Serhiy, any chance you'd have some time to review my patch(es)? Or is there someone else interested in zipfile I might interest? Thanks :-) |
Your patches are great Thomas! This is just what I want to implement. It will take some time to make a careful review. Besides possible corrections I think these features will be added in 3.6. But new features need tests and documenting. |
Thanks! I will work on docs and tests. |
The '2' versions of the two different patches include some docs and tests for these new features. |
I left some review comments. Since the rules for read-only and write-only access are so different, would it make more sense to have a separate method? Maybe your writefile() name, or open_write() or something? I am not too fussed, but unless there is a chance of being able to open a read-write random access file, I find these split-personality open() methods a bit “un-Pythonic”. On the other hand, I guess it is superficially consistent with other open() functions. Also, perhaps if you guaranteed the write-only option returned a file-like object, you could use shutil.copyfileobj() rather than a custom read-write loop. |
Here's a new version of the zf.open() patch following Martin's review (thanks Martin!). I agree that it feels a bit awkward having two completely different actions for zf.open(), but it is a familiar interface, and since the mode parameter is already there, it requires a minimum of new public API. But I'm happy to add a new method like open_write() or write_handle() if people prefer that. The comments on the other patch are minimal, I'll put a new version of that together as well. |
Thanks Serhiy for review comments. |
Updated version of the ZipInfo.from_file() patch attached. |
Thanks Martin for more review; updated open-to-write patch attached. |
Is there anything more I should be doing with either of these patches? I think I've incorporated all review comments I've seen. Thanks! |
New changeset 7fea2cebc604 by Serhiy Storchaka in branch 'default': |
Committed zipinfo-from-file5.patch. Now I'm starting to review zipfile-open-w4.patch (I concurred with most Martin's comments for previous patches). |
Thanks Serhiy! I'll keep an eye out for comments on the other patch. |
Hi Serhiy, any more comments on the zf.open() patch? |
Ping! ;-) |
Please ha Look on bpo-11980 http://bugs.python.org/issue11980 |
Thanks for the pointer Dhiraj. I prefer the open(mode="w") version proposed here, as being more flexible. This way you could wrap the writer object in e.g. TextIOWrapper. The other patch requires passing in a file reader object. Having another look at zipfile-open-w4.patch, I have some thoughts about locking and the writing-while-reading restriction: The lock seems to be designed to serialize reads and writes (which operate on the common underlying file object). See revision 4973ccd46e32, and <https://bugs.python.org/issue14099#msg234142\>, although it would be good to document this, or at the minimum add a comment explaining the purpose and scope of the lock. Currently, it appears that write() and writestr() acquire the lock, so I presume it is intended that these methods can be called multiple times concurrently, and also while the zip file is being read. With the patch, writestr() still preserves the lock usage, but write() does not because it is now implemented in terms of the new open(mode="w") method. I think it would be good to clarify that the lock does _not_ protect concurrent writes via open(mode="w"). |
My initial patch would have allowed passing a readable file-like object into zipfile. I was persuaded that allowing ZipFile.open() to return a writable object was a more intuitive and flexible API. Concurrent writes with zf.open(mode='w') should be impossible, because it only allows one open handle at a time. It still uses the lock inside _ZipWriteFile, so concurrent writes to a single handle should be serialised. I would not recommend anyone try to do concurrent access to a single ZipFile object from multiple threads or coroutines. It's quite stateful, there is no mention of concurrency in the docs, and no tests I can see that try concurrent access. The only thing that might be safe is reading from two files inside the zip file (which shouldn't be changed by this), but I wouldn't want to guarantee even that. |
Oh, I see test_interleaved now, which does test overlapping reads of two files from the same zip file. Do you want that clarified in the docs - which don't currently mention the lock at all - or in a comment in the module? |
When you say concurrent writes should be impossible, I guess that only applies to a single-threaded program. There is no lock protecting the “self._fileRefCnt > 1” check and related manipulation (not that I am saying there should be). For serializing concurrent writes to a single handle, if that is intended it should be documented. If it is not intended, maybe it should be removed (my preference)? It would be good to wait if Serhiy can explain the purpose of the lock, seeing he was involved in adding it and probably knows a lot more about this module than I. |
Serhiy, have you had a chance to look at what the zf.open(mode='w') patch does with the lock? |
Sorry for the delay Thomas. This is complex and important to me issue and I want to be attentive to it. I think we should preserve long existing behavior even if it is not documented (documenting or deprecating it is other issue). Concurrent reading and wring with concurrent reading always (at least since adding ZipFile.open()) worked, but was never documented nor tested. Concurrent writing was added rather as a side effect of bpo-14099. If there is a benefit from getting rid of it, we can break it. For preserving current behavior ZipFile.open(mode='w') should acquire the lock and it should be released in _ZipWriteFile.close(). I have added other comments on Rietveld. The patch needs to be updated to resolve conflicts with committed zipinfo-from-file5.patch. |
https://github.com/SpiderOak/ZipStream tries to implement streaming in one more kind. Also libarchive have experimental support of streaming write to zip archives in newest version. So problem is actual. |
Sorry for the delay, this fell off my radar because emails from both the bug tracker and Rietveld tend to fall foul of my spam filters, so I have to go and check. I have implemented Serhiy's suggestions, but there turns out to be a test (test_write_after_read) that calls writestr while a reading handle is open, and it now fails. This is absolutely expected according to the design we discussed, but if there's a test, I guess that means we need to keep that functionality working? In principle, the only thing that's not possible is interleaving writes to multiple files within the zip - because we don't know where to start writing the second file. We should be able to have one writer and n readers going at once, but every time I start looking into that I get mired in complexity. Maybe (hopefully) there's some critical abstraction that hasn't occurred to me. |
Martin, Serhiy: Am I right that it would be unacceptable to change the test, and I therefore need to find a way to allow writestr while a reading-mode handle is open? |
test_write_after_read was added in bpo-14099. It doesn't test intended behavior, it tests existing behavior just in case to not change it unintentionally. It is desirable to preserve it, but if there is no simple way, may be we can change it. |
It wasn't as bad as I thought (hopefully that doesn't mean I've missed something). zipfile-open-w6.patch allows interleaved reads and writes so long as the underlying file object is seekable. You still can't have multiple write handles open (it wouldn't know where to start the second file), or read & write handles on a non-seekable stream (I don't think a non-seekable read-write file ever makes sense, except perhaps for a pty, which is really two separate streams on one file descriptor). Tests are passing again. |
I understand Python’s zipfile module does not allow reading unless seek() is supported (this was one of Марк’s complaints). So it is irrelevant whether we are writing. I left a few comments, but it sounds like it is valid read and write at the same time. |
zipfile-open-w7 adds a test that Martin requested |
zipfile-open-w8 removes the concurrent read-write check for non-seekable files. As Martin points out, reading currently requires seeking, and a streaming read API would look very different from the current API. |
Any further review comments on this? |
Ping? Once this is landed, I intend to make a backport package of it so I can start using it before Python 3.6 comes out. |
I have added comments for tests and documentation. The implementation looks correct. But I prefer simpler solution on this stage. Here is something between zipfile-open-w5.patch and zipfile-open-w8.patch, close to zipfile-open-w5.patch with fixed issues.
This works in all cases when current code works. The ability to read while there is open write handler is separate new feature (and I don't know whether there is a need in it). |
Thanks Serhiy. I'm quite happy with that set of constraints. I had a look over your patch set, and just spotted one thing in writestr that I have overlooked all along. |
Thank you for your contribution Thomas. I'll push the patch with fixed writestr in short time unless Martin left comments. |
Yes this looks okay to me, so go ahead. My only concern was other methods affecting the file position while in the middle of writing, but it looks like those methods maintain separate objects like self.filelist, so there is no conflict. |
New changeset 175c3f1d0256 by Serhiy Storchaka in branch 'default': |
Thanks Serhiy! I am marking this as fixed. |
Let me know when you create a backport package Thomas. I'd like to help. |
There are yet few issues.
|
If I was writing it anew, I would use ValueError for something like an invalid mode parameter: zf.open('foo', mode='q') . That particular case was already in the code, though, so I think backwards compatibility should take precedence. 2 & 3. Agreed, I'll prepare a patch |
Patch attached for points 2 and 3 |
The bonus patch looks okay, although I wonder if the directory slash (/) information should be in the RST rather than doc string. Usually the RST has all the details, and doc strings are just summaries. Regarding exceptions, I can sympathise with both sides of the argument and don’t have a strong opinion (why does the exception type matter for programmer errors anyway?). But I think it might be better to be locally consistent within the zipfile module, and the module is already heavily documented with RuntimeError for similar programmer errors. |
Agreed that the docstring should be shorten version of the documentation and not vice versa. The directory slash information is implementation detail. I'm not sure it is needed in the documentation. |
Updated version of the patch with the detail moved to rst. |
New changeset 29b470b8ab77 by Serhiy Storchaka in branch 'default': |
LGTM. Thank you Thomas. |
The backported package now exists on PyPI as zipfile36. It's passing the tests on 3.4 - I haven't tested further back. I'm trying out Gitlab for hosting it: |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: