More flexibility in zipfile write interface #70227

takluyver · 2016-01-07T15:10:52Z

BPO	26039
Nosy	@takluyver, @vadmium, @serhiy-storchaka, @Carreau
Files	zipfile-flexibility.patch zipfile-open-w.patch zipinfo-from-file.patch: ZipInfo.from_file() classmethod zipinfo-from-file2.patch: ZipInfo.from_file() + docs + tests zipfile-open-w2.patch: zf.open(... mode='w') + docs + tests zipfile-open-w3.patch: zf.open(... mode='w') after vadmium's review zipinfo-from-file3.patch: ZipInfo.from_file() after vadmium's review zipinfo-from-file4.patch: ZipInfo.from_file() after storchaka's review zipinfo-from-file5.patch: ZipInfo.from_file() after storchaka's 2nd review zipfile-open-w4.patch: zf.open(... mode='w') after vadmium's 2nd review socketpair.py: example of monkey-patching zipfile-open-w5.patch zipfile-open-w6.patch zipfile-open-w7.patch zipfile-open-w8.patch zipfile-open-w9.patch zipfile-open-w-exceptions.patch zipfile-flex-bonus.patch zipfile-flex-bonus2.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/serhiy-storchaka'
closed_at = <Date 2016-05-15.09:28:40.190>
created_at = <Date 2016-01-07.15:10:51.849>
labels = ['type-feature', 'library']
title = 'More flexibility in zipfile write interface'
updated_at = <Date 2016-05-15.17:27:05.746>
user = 'https://github.com/takluyver'

bugs.python.org fields:

activity = <Date 2016-05-15.17:27:05.746>
actor = 'takluyver'
assignee = 'serhiy.storchaka'
closed = True
closed_date = <Date 2016-05-15.09:28:40.190>
closer = 'serhiy.storchaka'
components = ['Library (Lib)']
creation = <Date 2016-01-07.15:10:51.849>
creator = 'takluyver'
dependencies = []
files = ['41525', '41626', '41637', '41722', '41726', '41752', '41753', '41754', '41758', '41759', '42311', '42375', '42588', '42597', '42598', '42819', '42850', '42851', '42855']
hgrepos = []
issue_num = 26039
keywords = ['patch']
message_count = 58.0
messages = ['257694', '257696', '257697', '258313', '258459', '258949', '258958', '258959', '259022', '259184', '259206', '259209', '259255', '259258', '259553', '259805', '259806', '259833', '260305', '260733', '260734', '260746', '260808', '260810', '260852', '261162', '261168', '261328', '261332', '262534', '262537', '262543', '262557', '262926', '264062', '264070', '264169', '264205', '264242', '264248', '264720', '265232', '265330', '265386', '265388', '265459', '265461', '265532', '265540', '265541', '265544', '265546', '265559', '265580', '265592', '265603', '265605', '265635']
nosy_count = 6.0
nosy_names = ['python-dev', 'takluyver', 'martin.panter', 'serhiy.storchaka', 'mbussonn', 'Dhiraj_Mishra']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue26039'
versions = ['Python 3.6']

takluyver · 2016-01-07T15:10:49Z

I was working recently on some code writing zip files, and needed a bit more flexibility than the interface of zipfile provides. To do what I wanted, I ended up copying the entirety of ZipFile.write() into my own code with modifications. That's ugly, and to make it worse, the code I copied from the Python 3.4 stdlib didn't work properly with Python 3.5.

Specifically, I want to:

Override the timestamp, which write() unconditionally takes from the mtime of the file.
Do extra processing on the data (e.g. calculating a checksum) as it is read. Being able to pass a file-like object in to be read, rather than just a filename, would work for this.

I could do both by using ZipFile.writestr(), but then the entire file must be loaded into memory at once, which I would much rather avoid.

The patch attached is one proposal which would make it easier to do what I want, but it's intended as a starting point for discussion. I'm not particularly attached to the API.

Should this be a new method, or new functionality for either the write or writestr method? I favour a new method, because the existing methods are already quite long, and I think it's nicer to break write() up into two parts like this.
If a new method, what should it be called? I opted for writefile()
What should its signature be? It's currently writefile(file_obj, zinfo, force_zip64=False), but I can see an argument for aligning it with writestr(zinfo_or_arcname, data, compress_type=None).
What to do about ZIP64: the code has to decide whether to use ZIP64 format before writing, because it affects the header size, but we may not know the length before we have read it all. I've used a force_zip64 boolean parameter, and documented that people should pass it True if a file of unknown size could exceed 2 GiB.
Are there other points where it could be made more flexible while we're thinking about this? I'd quite like to split out the code for writing a directory entry to a separate method ('writedir'?). I'd also like to allow datetime objects to be passed in for timestamps as well as the 6-tuples currently expected.

Carreau · 2016-01-07T15:53:31Z

from the last blog post of Brett Cannon, I would say that he might be interested in Participating to this discussion (Rewrite zipimport from scratch section)

Though I'm not sure about the etiquette to ping someone on such an issue.

serhiy-storchaka · 2016-01-07T16:07:18Z

bpo-11980 has a patch that adds a method for writing a file into the ZIP archive.

But more flexible way is to add the support of the "w" mode to ZipFile.open(). This will allow to write a data that doesn't exist in memory at once, nor in disk, but is loaded by chunks from the pipe, or from the socket, or from other archive, or just generated on fly. I'm worked on this, but this feature needed to add support of some other features. For example writing to unseekable file (already implemented) or writing and reading a file without known size (not implemented still).

bpo-18595 has a proposition to add special methods for creating symlinks, directories etc.

takluyver · 2016-01-15T17:54:04Z

Attached is a first go at a patch enabling zipfile.open(blah, mode='w')

Files must be written sequentially, so you have to close one writing handle before opening another. If you try to open a second one before closing the first, it will raise RuntimeError. I considered doing something where it would write to temporary files and add them to the zip file when they were closed, but it seemed like a bad idea.

You can almost certainly break this by reading from a zip file while there's an open writing handle. Resolving this is tricky because there's a disconnect in the requirements for reading and writing: writing allows for a non-seekable output stream, but reading assumes that you can seek freely. The simplest fix is to block reading while there is an open file handle. I don't think many people will need to read one file from a zip while writing another, anyway.

I have used the lock, but I haven't thought carefully about thread safety, so that should still be checked carefully.

takluyver · 2016-01-17T10:48:44Z

zipinfo-from-file.patch has an orthogonal but related change: the code in ZipFile.write() to construct a ZipInfo object from a filesystem file is pulled out to a classmethod ZipInfo.from_file().

Together, these changes make it much easier to control how a file is written to a zip file, like this:

zi = ZipInfo.from_file(blah)
# ... manipulate zi...
with open(blah, 'rb') as src, zf.open(zi, 'w') as dest:
    # copy of the read/write loop - maybe this should be
    # pulled out separately as well?

If these changes make it in, I might put a backported copy of the module on PyPI so I can start using it without waiting for Python 3.6.

takluyver · 2016-01-26T11:19:11Z

Serhiy, any chance you'd have some time to review my patch(es)? Or is there someone else interested in zipfile I might interest? Thanks :-)

serhiy-storchaka · 2016-01-26T14:17:39Z

Your patches are great Thomas! This is just what I want to implement. It will take some time to make a careful review. Besides possible corrections I think these features will be added in 3.6. But new features need tests and documenting.

takluyver · 2016-01-26T14:19:07Z

Thanks! I will work on docs and tests.

takluyver · 2016-01-27T14:41:39Z

The '2' versions of the two different patches include some docs and tests for these new features.

vadmium · 2016-01-29T01:48:07Z

I left some review comments.

Since the rules for read-only and write-only access are so different, would it make more sense to have a separate method? Maybe your writefile() name, or open_write() or something? I am not too fussed, but unless there is a chance of being able to open a read-write random access file, I find these split-personality open() methods a bit “un-Pythonic”. On the other hand, I guess it is superficially consistent with other open() functions.

Also, perhaps if you guaranteed the write-only option returned a file-like object, you could use shutil.copyfileobj() rather than a custom read-write loop.

takluyver · 2016-01-29T12:08:50Z

Here's a new version of the zf.open() patch following Martin's review (thanks Martin!).

I agree that it feels a bit awkward having two completely different actions for zf.open(), but it is a familiar interface, and since the mode parameter is already there, it requires a minimum of new public API. But I'm happy to add a new method like open_write() or write_handle() if people prefer that.

The comments on the other patch are minimal, I'll put a new version of that together as well.

takluyver · 2016-01-29T14:10:03Z

Thanks Serhiy for review comments.

takluyver · 2016-01-30T12:18:39Z

Updated version of the ZipInfo.from_file() patch attached.

takluyver · 2016-01-30T15:49:41Z

Thanks Martin for more review; updated open-to-write patch attached.

takluyver · 2016-02-04T09:41:16Z

Is there anything more I should be doing with either of these patches? I think I've incorporated all review comments I've seen. Thanks!

python-dev · 2016-02-07T22:05:33Z

New changeset 7fea2cebc604 by Serhiy Storchaka in branch 'default':
Issue bpo-26039: Added zipfile.ZipInfo.from_file() and zipinfo.ZipInfo.is_dir().
https://hg.python.org/cpython/rev/7fea2cebc604

serhiy-storchaka · 2016-02-07T22:14:13Z

Committed zipinfo-from-file5.patch. Now I'm starting to review zipfile-open-w4.patch (I concurred with most Martin's comments for previous patches).

takluyver · 2016-02-08T09:55:30Z

Thanks Serhiy! I'll keep an eye out for comments on the other patch.

takluyver · 2016-02-15T10:37:04Z

Hi Serhiy, any more comments on the zf.open() patch?

takluyver · 2016-02-23T14:01:05Z

Ping! ;-)

DhirajMishra · 2016-02-23T14:16:32Z

Please ha Look on bpo-11980

http://bugs.python.org/issue11980
Already have been Patched

vadmium · 2016-02-23T21:52:13Z

Thanks for the pointer Dhiraj. I prefer the open(mode="w") version proposed here, as being more flexible. This way you could wrap the writer object in e.g. TextIOWrapper. The other patch requires passing in a file reader object.

Having another look at zipfile-open-w4.patch, I have some thoughts about locking and the writing-while-reading restriction:

The lock seems to be designed to serialize reads and writes (which operate on the common underlying file object). See revision 4973ccd46e32, and <https://bugs.python.org/issue14099#msg234142\>, although it would be good to document this, or at the minimum add a comment explaining the purpose and scope of the lock.

Currently, it appears that write() and writestr() acquire the lock, so I presume it is intended that these methods can be called multiple times concurrently, and also while the zip file is being read. With the patch, writestr() still preserves the lock usage, but write() does not because it is now implemented in terms of the new open(mode="w") method.

I think it would be good to clarify that the lock does _not_ protect concurrent writes via open(mode="w").

takluyver · 2016-02-24T14:50:57Z

My initial patch would have allowed passing a readable file-like object into zipfile. I was persuaded that allowing ZipFile.open() to return a writable object was a more intuitive and flexible API.

Concurrent writes with zf.open(mode='w') should be impossible, because it only allows one open handle at a time. It still uses the lock inside _ZipWriteFile, so concurrent writes to a single handle should be serialised.

I would not recommend anyone try to do concurrent access to a single ZipFile object from multiple threads or coroutines. It's quite stateful, there is no mention of concurrency in the docs, and no tests I can see that try concurrent access. The only thing that might be safe is reading from two files inside the zip file (which shouldn't be changed by this), but I wouldn't want to guarantee even that.

takluyver · 2016-02-24T14:57:00Z

Oh, I see test_interleaved now, which does test overlapping reads of two files from the same zip file.

Do you want that clarified in the docs - which don't currently mention the lock at all - or in a comment in the module?

vadmium · 2016-02-25T12:12:01Z

When you say concurrent writes should be impossible, I guess that only applies to a single-threaded program. There is no lock protecting the “self._fileRefCnt > 1” check and related manipulation (not that I am saying there should be).

For serializing concurrent writes to a single handle, if that is intended it should be documented. If it is not intended, maybe it should be removed (my preference)?

It would be good to wait if Serhiy can explain the purpose of the lock, seeing he was involved in adding it and probably knows a lot more about this module than I.

takluyver · 2016-03-03T12:04:24Z

Serhiy, have you had a chance to look at what the zf.open(mode='w') patch does with the lock?

serhiy-storchaka · 2016-03-03T15:55:54Z

Sorry for the delay Thomas. This is complex and important to me issue and I want to be attentive to it.

I think we should preserve long existing behavior even if it is not documented (documenting or deprecating it is other issue). Concurrent reading and wring with concurrent reading always (at least since adding ZipFile.open()) worked, but was never documented nor tested. Concurrent writing was added rather as a side effect of bpo-14099. If there is a benefit from getting rid of it, we can break it.

For preserving current behavior ZipFile.open(mode='w') should acquire the lock and it should be released in _ZipWriteFile.close().

I have added other comments on Rietveld. The patch needs to be updated to resolve conflicts with committed zipinfo-from-file5.patch.

socketpair · 2016-03-28T05:38:10Z

https://github.com/SpiderOak/ZipStream tries to implement streaming in one more kind.

Also libarchive have experimental support of streaming write to zip archives in newest version.

So problem is actual.

takluyver · 2016-04-06T00:16:38Z

Sorry for the delay, this fell off my radar because emails from both the bug tracker and Rietveld tend to fall foul of my spam filters, so I have to go and check.

I have implemented Serhiy's suggestions, but there turns out to be a test (test_write_after_read) that calls writestr while a reading handle is open, and it now fails. This is absolutely expected according to the design we discussed, but if there's a test, I guess that means we need to keep that functionality working?

In principle, the only thing that's not possible is interleaving writes to multiple files within the zip - because we don't know where to start writing the second file. We should be able to have one writer and n readers going at once, but every time I start looking into that I get mired in complexity. Maybe (hopefully) there's some critical abstraction that hasn't occurred to me.

takluyver · 2016-04-23T16:10:08Z

Martin, Serhiy: Am I right that it would be unacceptable to change the test, and I therefore need to find a way to allow writestr while a reading-mode handle is open?

serhiy-storchaka · 2016-04-23T20:59:39Z

test_write_after_read was added in bpo-14099. It doesn't test intended behavior, it tests existing behavior just in case to not change it unintentionally. It is desirable to preserve it, but if there is no simple way, may be we can change it.

takluyver · 2016-04-25T11:20:31Z

It wasn't as bad as I thought (hopefully that doesn't mean I've missed something). zipfile-open-w6.patch allows interleaved reads and writes so long as the underlying file object is seekable.

You still can't have multiple write handles open (it wouldn't know where to start the second file), or read & write handles on a non-seekable stream (I don't think a non-seekable read-write file ever makes sense, except perhaps for a pty, which is really two separate streams on one file descriptor).

Tests are passing again.

vadmium · 2016-04-26T02:29:45Z

I understand Python’s zipfile module does not allow reading unless seek() is supported (this was one of Марк’s complaints). So it is irrelevant whether we are writing.

I left a few comments, but it sounds like it is valid read and write at the same time.

takluyver · 2016-04-26T09:30:14Z

zipfile-open-w7 adds a test that Martin requested

takluyver · 2016-04-26T11:01:08Z

zipfile-open-w8 removes the concurrent read-write check for non-seekable files. As Martin points out, reading currently requires seeking, and a streaming read API would look very different from the current API.

takluyver · 2016-05-03T13:05:56Z

Any further review comments on this?

takluyver · 2016-05-10T10:12:26Z

Ping? Once this is landed, I intend to make a backport package of it so I can start using it before Python 3.6 comes out.

serhiy-storchaka · 2016-05-11T17:14:05Z

I have added comments for tests and documentation. The implementation looks correct. But I prefer simpler solution on this stage. Here is something between zipfile-open-w5.patch and zipfile-open-w8.patch, close to zipfile-open-w5.patch with fixed issues.

Writer doesn't use look and seeking.
Writing while there are open read handlers is permitted.
Reading while there is open write handler is forbidden.

This works in all cases when current code works. The ability to read while there is open write handler is separate new feature (and I don't know whether there is a need in it).

takluyver · 2016-05-12T09:57:11Z

Thanks Serhiy. I'm quite happy with that set of constraints. I had a look over your patch set, and just spotted one thing in writestr that I have overlooked all along.

serhiy-storchaka · 2016-05-12T10:12:43Z

Thank you for your contribution Thomas. I'll push the patch with fixed writestr in short time unless Martin left comments.

vadmium · 2016-05-13T09:58:04Z

Yes this looks okay to me, so go ahead.

My only concern was other methods affecting the file position while in the middle of writing, but it looks like those methods maintain separate objects like self.filelist, so there is no conflict.

python-dev · 2016-05-13T10:53:32Z

New changeset 175c3f1d0256 by Serhiy Storchaka in branch 'default':
Issue bpo-26039: zipfile.ZipFile.open() can now be used to write data into a ZIP
https://hg.python.org/cpython/rev/175c3f1d0256

takluyver · 2016-05-14T16:01:36Z

Thanks Serhiy! I am marking this as fixed.

serhiy-storchaka · 2016-05-14T19:21:33Z

Let me know when you create a backport package Thomas. I'd like to help.

serhiy-storchaka · 2016-05-14T20:59:48Z

There are yet few issues.

Currently RuntimeError is widely used in zipfile. But in most cases ValueError would be more appropriate since it programming error (for example ValueError is raised when try to read or write to closed file). See bpo-24693, but this is backward incompatible change. For new exceptions we are free to use any exception type without breaking backward compatibility. It looks to me, that ValueError is more appropriate is these cases than RuntimeError. What do you think about this Thomas? Here is a patch that changes exceptions types.
ZipInfo.is_dir() is not documented and lacks a docstring.
It would be better to make force_zip64 a keyword-only parameter. This would allow to add new positional parameters without breaking compatibility.

takluyver · 2016-05-14T21:25:06Z

For the cases like read/write handles conflicting, ValueError doesn't seem quite right to me - I take ValueError to mean that you have passed an invalid value to a function. RuntimeError feels like the closest fit for 'the current state does not allow this operation'. It is rather broad, though.

If I was writing it anew, I would use ValueError for something like an invalid mode parameter: zf.open('foo', mode='q') . That particular case was already in the code, though, so I think backwards compatibility should take precedence.

2 & 3. Agreed, I'll prepare a patch

takluyver · 2016-05-14T22:02:22Z

Patch attached for points 2 and 3

vadmium · 2016-05-14T23:29:06Z

The bonus patch looks okay, although I wonder if the directory slash (/) information should be in the RST rather than doc string. Usually the RST has all the details, and doc strings are just summaries.

Regarding exceptions, I can sympathise with both sides of the argument and don’t have a strong opinion (why does the exception type matter for programmer errors anyway?). But I think it might be better to be locally consistent within the zipfile module, and the module is already heavily documented with RuntimeError for similar programmer errors.

serhiy-storchaka · 2016-05-15T05:25:10Z

Agreed that the docstring should be shorten version of the documentation and not vice versa. The directory slash information is implementation detail. I'm not sure it is needed in the documentation.

takluyver · 2016-05-15T08:05:23Z

Updated version of the patch with the detail moved to rst.

python-dev · 2016-05-15T09:27:49Z

New changeset 29b470b8ab77 by Serhiy Storchaka in branch 'default':
Issue bpo-26039: Document ZipInfo.is_dir() and make force_zip64 keyword-only.
https://hg.python.org/cpython/rev/29b470b8ab77

serhiy-storchaka · 2016-05-15T09:28:40Z

LGTM. Thank you Thomas.

takluyver · 2016-05-15T17:27:06Z

The backported package now exists on PyPI as zipfile36. It's passing the tests on 3.4 - I haven't tested further back. I'm trying out Gitlab for hosting it:
https://gitlab.com/takluyver/zipfile36

takluyver mannequin added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Jan 7, 2016

serhiy-storchaka self-assigned this Jan 26, 2016

takluyver mannequin closed this as completed May 14, 2016

serhiy-storchaka reopened this May 14, 2016

serhiy-storchaka closed this as completed May 15, 2016

ezio-melotti transferred this issue from another repository Apr 10, 2022

More flexibility in zipfile write interface #70227

More flexibility in zipfile write interface #70227

Comments

takluyver mannequin commented Jan 7, 2016

takluyver mannequin commented Jan 7, 2016

Carreau mannequin commented Jan 7, 2016

serhiy-storchaka commented Jan 7, 2016

takluyver mannequin commented Jan 15, 2016

takluyver mannequin commented Jan 17, 2016

takluyver mannequin commented Jan 26, 2016

serhiy-storchaka commented Jan 26, 2016

takluyver mannequin commented Jan 26, 2016

takluyver mannequin commented Jan 27, 2016

vadmium commented Jan 29, 2016

takluyver mannequin commented Jan 29, 2016

takluyver mannequin commented Jan 29, 2016

takluyver mannequin commented Jan 30, 2016

takluyver mannequin commented Jan 30, 2016

takluyver mannequin commented Feb 4, 2016

python-dev mannequin commented Feb 7, 2016

serhiy-storchaka commented Feb 7, 2016

takluyver mannequin commented Feb 8, 2016

takluyver mannequin commented Feb 15, 2016

takluyver mannequin commented Feb 23, 2016

DhirajMishra mannequin commented Feb 23, 2016

vadmium commented Feb 23, 2016

takluyver mannequin commented Feb 24, 2016

takluyver mannequin commented Feb 24, 2016

vadmium commented Feb 25, 2016

takluyver mannequin commented Mar 3, 2016

serhiy-storchaka commented Mar 3, 2016

socketpair mannequin commented Mar 28, 2016

takluyver mannequin commented Apr 6, 2016

takluyver mannequin commented Apr 23, 2016

serhiy-storchaka commented Apr 23, 2016

takluyver mannequin commented Apr 25, 2016

vadmium commented Apr 26, 2016

takluyver mannequin commented Apr 26, 2016

takluyver mannequin commented Apr 26, 2016

takluyver mannequin commented May 3, 2016

takluyver mannequin commented May 10, 2016

serhiy-storchaka commented May 11, 2016

takluyver mannequin commented May 12, 2016

serhiy-storchaka commented May 12, 2016

vadmium commented May 13, 2016

python-dev mannequin commented May 13, 2016

takluyver mannequin commented May 14, 2016

serhiy-storchaka commented May 14, 2016

serhiy-storchaka commented May 14, 2016

takluyver mannequin commented May 14, 2016

takluyver mannequin commented May 14, 2016

vadmium commented May 14, 2016

serhiy-storchaka commented May 15, 2016

takluyver mannequin commented May 15, 2016

python-dev mannequin commented May 15, 2016

serhiy-storchaka commented May 15, 2016

takluyver mannequin commented May 15, 2016