classification
Title: Improve pickling efficiency of itertools.cycle
Type: resource usage Stage: resolved
Components: Extension Modules Versions: Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: alexandre.vassalotti, pitrou, python-dev, rhettinger, serhiy.storchaka
Priority: low Keywords: patch

Created on 2015-08-15 21:23 by rhettinger, last changed 2015-08-16 21:52 by rhettinger. This issue is now closed.

Files
File name Uploaded Description Edit
time_cycle.py rhettinger, 2015-08-15 22:30 Simple timing suite for cycle()
cycle5_brokensetstate.diff rhettinger, 2015-08-15 23:28 Partial patch -- still needs work on __setstate__ review
cycle9.diff rhettinger, 2015-08-16 05:05 More complete patch that passes all tests review
cycle_reduce_2.patch serhiy.storchaka, 2015-08-16 08:44 Simpler, faster and more memory efficient pickling review
cycle_reduce_3.patch serhiy.storchaka, 2015-08-16 08:44 Faster unpickled cycle object review
Messages (8)
msg248662 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-08-15 21:32
When a cycle object has fully consumed its input iterable, __reduce__ method uses the returns a space-inefficient result when space-efficient alternative is available.

# Current way of restoring a cycle object with excess info in setstate:
>>> c = cycle(iter('de'))
>>> c.__setstate__((['a', 'b', 'c', 'd', 'e'], 1))
>>> ''.join(next(c) for i in range(20)) # next 20 values
'deabcdeabcdeabcdeabc'

# The same result can be achieved with less information: 
>>> c = cycle(iter('de'))
>>> c.__setstate__((['a', 'b', 'c'], 0))
>>> ''.join(next(c) for i in range(20)) # next 20 values
'deabcdeabcdeabcdeabc'
msg248664 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-08-15 21:54
Also, looking at the source for itertools.cycle(), it looks like the overall speed could be boosted considerably by looping over the saved list directly rather than allocating a new list iterator every time the cycle loops around.
msg248669 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-08-15 23:28
Attaching a partial patch:
* More than doubles the speed of cycle()
* Cuts size of __reduce__ result by about a third (on average)
* Still needs work on __setstate__ for a correct restore.
msg248674 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-08-16 04:57
Current cycle implementation is simple and clever, but can be optimized. The part about iterating LGTM (but looks the firstpass field can be eliminated at all). But __reduce__ doesn't look so optimal. It makes a copy of a list and makes iterating an unpickled cycle object slow. It would be more optimal if create new list with rotated content or even rotate original list inplace.
msg248675 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-08-16 05:52
Added an updated patch that passes all tests.
msg248677 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-08-16 08:44
Original Raymonds reason in msg248662 is not valid. Pickling a cycle object that fully consumed its input iterable is already space-inefficient.

>>> import itertools, pickle, pickletools
>>> c = itertools.cycle(iter('abcde'))
>>> [next(c) for i in range(8)]
['a', 'b', 'c', 'd', 'e', 'a', 'b', 'c']
>>> pickle.dumps(c)
b'\x80\x03citertools\ncycle\nq\x00cbuiltins\niter\nq\x01]q\x02(X\x01\x00\x00\x00aq\x03X\x01\x00\x00\x00bq\x04X\x01\x00\x00\x00cq\x05X\x01\x00\x00\x00dq\x06X\x01\x00\x00\x00eq\x07e\x85q\x08Rq\tK\x03b\x85q\nRq\x0bh\x02K\x01\x86q\x0cb.'
>>> pickletools.dis(pickle.dumps(c))
    0: \x80 PROTO      3
    2: c    GLOBAL     'itertools cycle'
   19: q    BINPUT     0
   21: c    GLOBAL     'builtins iter'
   36: q    BINPUT     1
   38: ]    EMPTY_LIST
   39: q    BINPUT     2
   41: (    MARK
   42: X        BINUNICODE 'a'
   48: q        BINPUT     3
   50: X        BINUNICODE 'b'
   56: q        BINPUT     4
   58: X        BINUNICODE 'c'
   64: q        BINPUT     5
   66: X        BINUNICODE 'd'
   72: q        BINPUT     6
   74: X        BINUNICODE 'e'
   80: q        BINPUT     7
   82: e        APPENDS    (MARK at 41)
   83: \x85 TUPLE1
   84: q    BINPUT     8
   86: R    REDUCE
   87: q    BINPUT     9
   89: K    BININT1    3
   91: b    BUILD
   92: \x85 TUPLE1
   93: q    BINPUT     10
   95: R    REDUCE
   96: q    BINPUT     11
   98: h    BINGET     2
  100: K    BININT1    1
  102: \x86 TUPLE2
  103: q    BINPUT     12
  105: b    BUILD
  106: .    STOP
highest protocol among opcodes = 2

An internal iterator is not pickled as iter("de"), but as an iterator of the list ["a", "b", "c", "d", "e"] with 3 items consumed. This list also saved as a part of a cycle object state, but not as a copy, but as a reference.

There are two alternative patches. Both keep Raymonds optimization of cycle iterating, but have advantages. cycle_reduce_2.patch makes __reduce__ faster and more memory efficient than Raymonds variant. cycle_reduce_3.patch makes unpickled cycle object so optimized as original.
msg248692 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-08-16 21:49
New changeset 17b5c8ba6875 by Raymond Hettinger in branch 'default':
Issue #24874: Speed-up itertools and make it pickles more compact.
https://hg.python.org/cpython/rev/17b5c8ba6875
msg248693 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-08-16 21:51
Applied the cycle2 patch but kept the signature the same as the original reduce (using a number instead of a boolean).
History
Date User Action Args
2015-08-16 21:52:00rhettingersetstatus: open -> closed
resolution: fixed
messages: + msg248693

stage: patch review -> resolved
2015-08-16 21:49:42python-devsetnosy: + python-dev
messages: + msg248692
2015-08-16 15:45:20rhettingersetpriority: normal -> low
2015-08-16 08:44:41serhiy.storchakasetfiles: + cycle_reduce_3.patch
2015-08-16 08:44:29serhiy.storchakasetfiles: + cycle_reduce_2.patch
2015-08-16 08:44:10serhiy.storchakasetmessages: + msg248677
2015-08-16 05:52:39rhettingersetmessages: + msg248675
2015-08-16 05:05:26rhettingersetfiles: + cycle9.diff
2015-08-16 04:57:52serhiy.storchakasetmessages: + msg248674
2015-08-16 04:03:17serhiy.storchakasetnosy: + pitrou, alexandre.vassalotti, serhiy.storchaka

stage: patch review
2015-08-15 23:28:54rhettingersetfiles: + cycle5_brokensetstate.diff
keywords: + patch
messages: + msg248669
2015-08-15 22:30:16rhettingersetfiles: + time_cycle.py
2015-08-15 21:54:15rhettingersetmessages: + msg248664
2015-08-15 21:32:35rhettingersetmessages: - msg248657
2015-08-15 21:32:26rhettingersetmessages: + msg248662
2015-08-15 21:23:07rhettingercreate