classification
Title: Make iterators pickleable
Type: enhancement Stage:
Components: Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: georg.brandl, jcea, kristjan.jonsson, loewis, michael.foord, pitrou, python-dev, rhettinger, sbt
Priority: normal Keywords: patch

Created on 2012-03-13 16:10 by kristjan.jonsson, last changed 2012-04-08 17:07 by kristjan.jonsson. This issue is now closed.

Files
File name Uploaded Description Edit
pickling.patch kristjan.jonsson, 2012-03-13 16:10 review
pickling2.patch kristjan.jonsson, 2012-03-27 14:28 review
Messages (25)
msg155626 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2012-03-13 16:10
A common theme in many talks last year about cloud computing was the need to suspend execution, pickle state, and resume it on a different node.  This patch is the result of last year's stackless sprint at pycon, finally completed and submitted for review.

Python does not currently support pickling of many run-time structures, but pickling for things like iterators is trivial.
A large piece of Stackless' branch is to make sure that various run-time constructs are pickleable, including function objects.  While this patch does not do that, it does add pickling for dictiter, and the lot.
This makes it possible to have compilcated data sets, iterate through them, and pickle them in a semi-consumed state.

Please note that a slight hack is needed to pickle some iterators.  Many of these classes are namely "hidden" and there is no way to access their constructors by name.  instead, an unpickling trick is to invoke "iter" on an object of the target type instead.  Not the most elegant solution but I didn't want to complicate matters by adding iterator classes into namespaces.  Where should stringiter live for example?  Be a builtin like str?

We also didn't aim to make all iterators copy.copy-able using the __reduce__ protocol.  Some iterators actually use internal iterators themselves, and if a (non-deep) copy were to happen, we would have to shallow copy those internal objects.  Instead, we just return the internal iterator object directly from __reduce__ and allow recursive pickling to proceed.
msg155669 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2012-03-13 21:24
ISTM that a discussion on python-dev would be of value here.  Iterators are a protocol, not a class; hence, the effort to make them all picklable is potentially endless.  The effort would also always be incomplete because some iterators are difficult or inconvenient to pickle (esp. those with inputs from local or temporary resources).

Experience with SQL hasn't shown a need to save partially consumed cursors and my experience with iterators indicates that the need for pickling would be rare or that is would distract from better solutions (perhaps message based or somesuch).

The size of this patch is a hint that the idea is not a minor change and that it would add a maintenance burden.  What is far from clear is whether it would have value for any real-world problems.
msg155683 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2012-03-13 22:27
Sure, I'll start a discussion there, but at least I've gotten the patch in.  The patch is smaller than it looks, most of it is tests.
msg155685 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2012-03-13 22:38
The patch looks fine.  If python-dev thinks that making-iterators-picklable is worth doing, I would support this going into Python 3.3 (no extra benefit will come from waiting).
msg155704 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2012-03-14 01:19
Btw, is there some way I can make this patch easier to review? I haven't contributed much since the Hg switchover, can I make it so that people can do visual diff?
msg155735 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2012-03-14 07:27
The "review" link next to the the patch file entry should already work and provide a nice visual diff + commenting interface.
msg155775 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-03-14 17:25
I think

    PyAPI_FUNC(PyObject *) _PyIter_GetIter(const char *iter);

has a confusing name for a convenience function which retrieves an attribute from the builtin module by name.

Not sure what would be better.  Maybe _PyIter_GetBuiltin().
msg155776 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2012-03-14 17:30
another trick has been suggested.  For hidden iterator objects, such as stringiter, to actually put them in the "types" module.
in there, we could do something like:
#types.py
stringiter = iter('').__class__

and we would then change the name of the iterator in c to be "types.stringiter".

How does that sound?  It _does_ make it necessary for the types module to be there to help with pickling.  The _proper_ fix would be for e.g. stringiter to live in builtins, next to 'str' that it is iterating over.  Any thoughts?
msg156688 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2012-03-24 10:19
sbt, I will fix the api name.  Any other objections then?  Leave it as it is with the "iter()" trick?
msg156720 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2012-03-24 22:40
Has python-dev discussion been launched?  It is far from clear that this is worth doing.  Pickling runtime structures may be a normal use case for Stackless but isn't a normal use case for regular Python.  

Also, it seems pointless to start down this path because it will always be incomplete (i.e. pickling running generators, or socket streams, etc).

It also seems to be at odds with the normal use case for passing around partially consumed iterators -- the laziness and memory friendliness is a desired feature; however, the patch pickles some iterators (such as dicts) by running them to completion and storing *all* of the results that would have been returned.
msg156722 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-03-24 23:51
I think the "worth doing" argument doesn't really hold, given that it's done. The question at hand really is
a) is the patch correct?
b) can we commit to maintaining it, even as things around it may change?

I'm not bothered with the patch being potentially incomplete: anybody wishing to pickle more things should contribute patches for that.

As for a), I think we should give people some time to review, and then wait for beta releases to discover issues. If this is a rarely-used feature, the world won't end if it has bugs.

As for b), I think the main issue is forward compatibility: will pickles created by 3.3 still be readable by future versions?
msg156723 - (view) Author: Michael Foord (michael.foord) * (Python committer) Date: 2012-03-24 23:55
Yes there was a discussion on python-dev. Various people spoke in favour, no-one against:

http://mail.python.org/pipermail/python-dev/2012-March/117566.html
msg156731 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2012-03-25 02:38
Michael, thanks for the link.  The email was clearer about its rationale than was listed here.

When this patch gets applied, any discussion of it in the docs should be clear that generators aren't included and that pickling things like dict iterators entail running the iterator to completion and storing all of the results in a list.
msg156749 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-03-25 11:40
> ... and that pickling things like dict iterators entail running the 
> iterator to completion and storing all of the results in a list.

The thing to emphasise here is that pickling an iterator is "destructive": afterwards the original iterator will be "empty".

I can't think of any other examples where pickling an object causes non-trivial mutation of that object.

Come to think of it, doesn't copy.copy() delegate to __reduce__()/__reduce_ex__().  It would be a bit surprising if copy.copy(myiterator) were to consume myiterator.  I expect copy.copy() to return an independent copy without mutating the original object.
msg156751 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-03-25 12:02
> The thing to emphasise here is that pickling an iterator is
> "destructive": afterwards the original iterator will be "empty".

If you look at the patch it isn't (or shouldn't be).

I agree with Raymond that accumulating dict and set iterators in a list is a bit weird. That said, with hash randomization, perhaps we can't do any better (the order of elements in the internal table depends on the process-wide hash seed).
msg156756 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-03-25 12:38
> If you look at the patch it isn't (or shouldn't be).

Sorry.  I misunderstood when Raymond said "running the iterator to completion".
msg156902 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2012-03-27 10:57
Okay, I'll go ahead, fix the 'iter()' trick api name and apply the patch.  Then we'll see what happens :).
Any suggestion towards what documentation changes are needed?  I don't think the list of pickleable objects is made explicit anywhere, it is largely a trial and error thing.  But obviously Misc/News is the prime candidate.
Thanks
msg156903 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-03-27 11:03
> Okay, I'll go ahead, fix the 'iter()' trick api name and apply the
> patch.  Then we'll see what happens :).

Please wait for reviews.
msg156904 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2012-03-27 11:13
Raymond had already reviewed it, and sbt.  I wasn't aware of any more pending reviews, but I'll wait for yours, of course.
msg156906 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-03-27 11:15
Well, please take a look at the review link. There are already some comments there.
msg156915 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2012-03-27 12:09
Btw, regarding compatibility:  The docs say " The pickle serialization format is guaranteed to be backwards compatible across Python releases."

I take this to mean the serialization format itself.  I don't think there is a broader guarantee that pickles generated by one version can be read by another, since objects can change their internal representation, types can even disappear or change.

In that sense, pickles made by 2.7 can be read by 3.2, in the sense that they will correctly return an error when they can't construct a 'str' object.

If I misunderstand things, then at least I think that the pickle documentation should be made clearer, that not only is the protocol supposed to be read, but also that entire pickles should always be readable _and_ instantiatable, by future versions
msg156928 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2012-03-27 14:28
I've incorporated antoine's comments and the proposed internal function name into a new patch.

A lot of the changes concerned thecking the type() of the unpickled iterator.  Now, it wasn't a specific design goal to get the exact same objects back, only _equivalent_ objects.  In particular, dicts and sets have problems, so a dictiter to a partially consumed dict cannot be pickled as it is.

so, I've added type cases everywhere, but for those cases.
Now, how important do you think type consistency is?  when using iterators, does one ever look at it and test its type?  if this is important, I _could_ take another look at dicts and seta and create fresh iterators to the dicts and sets made out of the remainder of the items, rather than iterators to lists.

Any thoughs?
msg156933 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-03-27 15:10
> Now, how important do you think type consistency is?  when using
> iterators, does one ever look at it and test its type?  if this is
> important, I _could_ take another look at dicts and seta and create
> fresh iterators to the dicts and sets made out of the remainder of the
> items, rather than iterators to lists.

I think type consistency is important if it can be achieved reasonably
simply.
In the dict and set case, I'm not sure you can recreate the internal
table in the same order (even accross interpreter restarts). In this
case, you should just check that the unpickled data is a subclass of
collections.abc.Iterator.
msg157405 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2012-04-03 09:56
Good idea Antoine.
So, I'll with your suggested fix to the unittests I'll commit this and then look on while Rome burns.
msg157408 - (view) Author: Roundup Robot (python-dev) Date: 2012-04-03 11:12
New changeset 4ff234337e24 by Kristján Valur Jónsson in branch 'default':
Issue #14288: Serialization support for builtin iterators.
http://hg.python.org/cpython/rev/4ff234337e24

New changeset 51c88d51aa4a by Kristján Valur Jónsson in branch 'default':
Issue #14288: Modify Misc/NEWS
http://hg.python.org/cpython/rev/51c88d51aa4a
History
Date User Action Args
2012-04-08 17:07:43kristjan.jonssonsetstatus: open -> closed
resolution: fixed
2012-04-03 11:12:04python-devsetnosy: + python-dev
messages: + msg157408
2012-04-03 09:56:32kristjan.jonssonsetmessages: + msg157405
2012-03-27 15:10:06pitrousetmessages: + msg156933
2012-03-27 14:29:08kristjan.jonssonsetfiles: + pickling2.patch

messages: + msg156928
2012-03-27 12:09:01kristjan.jonssonsetmessages: + msg156915
2012-03-27 11:15:29pitrousetmessages: + msg156906
2012-03-27 11:13:29kristjan.jonssonsetmessages: + msg156904
2012-03-27 11:03:41pitrousetmessages: + msg156903
2012-03-27 10:57:31kristjan.jonssonsetmessages: + msg156902
2012-03-25 21:31:46rhettingersetassignee: rhettinger ->
2012-03-25 12:38:21sbtsetmessages: + msg156756
2012-03-25 12:02:37pitrousetmessages: + msg156751
2012-03-25 11:40:22sbtsetmessages: + msg156749
2012-03-25 02:38:28rhettingersetmessages: + msg156731
2012-03-24 23:55:28michael.foordsetmessages: + msg156723
2012-03-24 23:51:39loewissetmessages: + msg156722
2012-03-24 22:40:37rhettingersetmessages: + msg156720
2012-03-24 13:53:49pitrousetnosy: + pitrou
2012-03-24 10:19:21kristjan.jonssonsetmessages: + msg156688
versions: + Python 3.3, - Python 3.4
2012-03-14 19:37:07rhettingersetassignee: rhettinger
2012-03-14 17:30:59kristjan.jonssonsetmessages: + msg155776
2012-03-14 17:25:37sbtsetnosy: + sbt
messages: + msg155775
2012-03-14 16:17:10rhettingersetmessages: - msg155759
2012-03-14 15:44:39rhettingersetmessages: + msg155759
2012-03-14 07:27:41georg.brandlsetnosy: + georg.brandl
messages: + msg155735
2012-03-14 01:19:59kristjan.jonssonsetmessages: + msg155704
2012-03-13 22:38:05rhettingersetmessages: + msg155685
2012-03-13 22:27:29kristjan.jonssonsetmessages: + msg155683
2012-03-13 21:24:31rhettingersetnosy: + rhettinger
messages: + msg155669
2012-03-13 18:30:40jceasetnosy: + jcea
2012-03-13 16:10:19kristjan.jonssoncreate