classification
Title: Fix pickling efficiency of named tuples in 2.7.3
Type: behavior Stage: needs patch
Components: Library (Lib) Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: amaury.forgeotdarc, anselm.kruis, barry, benhoyt, benjamin.peterson, flox, georg.brandl, jcea, larry, pitrou, python-dev, rhettinger, serhiy.storchaka, thomie
Priority: normal Keywords: patch

Created on 2012-08-02 13:44 by thomie, last changed 2015-05-23 02:32 by rhettinger. This issue is now closed.

Files
File name Uploaded Description Edit
show_namedtuple_pickle_fix.py thomie, 2012-08-15 10:31 Show namedtuple pickle fix
namedtuple_pickle_fix.patch thomie, 2012-08-31 19:56 review
namedtuple-pickle.diff amaury.forgeotdarc, 2012-09-05 23:25 review
unpickletest.py benhoyt, 2013-04-26 01:02 Test namedtuple unpickling memory usage issues
Messages (15)
msg167215 - (view) Author: Thomas Miedema (thomie) Date: 2012-08-02 13:44
Pickling a namedtuple Point(x=10, y=20, z=30) in Python 2.7.2 with protocol level 0 would result in something like the following output:

  ccopy_reg
  _reconstructor
  p0
  (c__main__
  Point
  p1
  c__builtin__
  tuple
  p2
  (I10
  I20
  I30
  tp3
  tp4
  Rp5
  .

In Python 2.7.3, the same namedtuple dumps to:

  ccopy_reg
  _reconstructor
  p0
  (c__main__
  Point
  p1
  c__builtin__
  tuple
  p2
  (I10
  I20
  I30
  tp3
  tp4
  Rp5
  ccollections
  OrderedDict
  p6
  ((lp7
  (lp8
  S'x'
  p9
  aI10
  aa(lp10
  S'y'
  p11
  aI20
  aa(lp12
  S'z'
  p13
  aI30
  aatp14
  Rp15
  b.

Note the OrderedDictionary at the end. All data, the field names and the values, are duplicated, which can result in very large pickled files when using nested namedtuples.

Loading both dumps with CPython 2.7.3 works. This is why this bug was not noticed any earlier. Loading the second dump with CPython or pypy 2.7.2 does not work however. CPython 2.7.3 broke forward compatibility.

Attached is a patch with a fix. The patch makes pickled namedtuples forward compatibile with 2.7.2. This patch does not break backward compability with 2.7.3, since the extra OrderedDict data contained the same information as the tuple. 

Introduced:
http://hg.python.org/cpython/diff/26d5f022eb1a/Lib/collections.py

Also relevant:
http://bugs.python.org/issue3065
msg168273 - (view) Author: Thomas Miedema (thomie) Date: 2012-08-15 10:31
Attached is a script that shows the problem at hand.

Note that my remark that this bug could result in very large pickled files when using nested namedtuples seems not te be true.
msg169578 - (view) Author: Thomas Miedema (thomie) Date: 2012-08-31 19:56
Added a better testcase.
msg169597 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2012-09-01 00:29
FWIW, all pickle protocol levels are affected:

Point = colletions.namedtuple('Point', ['x', 'y', 'z'])
for proto in range(3):
    pickletools.dis(dumps(Point(10, 20, 30), proto))

I'll look at the proposed fix in more detail when I get a chance -- we want to make sure that subclasses aren't adversely affected and that there aren't any other unintended side-effects.
msg169892 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2012-09-05 23:25
Adding "def __getstate__(self): return None" to the namedtuple template fixes the issue.  Here is a patch with test.
msg187825 - (view) Author: Ben Hoyt (benhoyt) * Date: 2013-04-26 01:02
I just hit this issue in a big way -- would have been nice for this fix to go into Python 2.7.4. :-)

It was quite hard to track down (as in, a day or two of debugging :-) because the symptoms didn't point directly to namedtuple. In our setup we pickle/unpickle some big files, and the symptoms we noticed were extremely high memory usage after *un*pickling -- as in, 3x what we were getting before upgrading from Python 2.6. We first tracked it down to unpickling, and then from there narrowed it down to namedtuple.

The first "fix" I discovered was that I could use pickletools.optimize() to reduce the memory-usage-after-unpickling back down to sensible levels. I don't know enough about pickle to know exactly why this is -- perhaps fragmentation due to extra unpickling data structures allocated on the heap, that optimize() removes?

Here's the memory usage of our Python process after unpickling a ~9MB pickle file (protocol 2) which includes a lot of namedtuples. This is on Python 2.7.4 64-bit. With the original collections.py -- "normal" means un-optimized pickle, "optimized" means run through pickletools.optimize():

Memory usage after loading normal: 106664 KB
Memory usage after loading optimized: 31424 KB

With collections.py modified so namedtuple's templates include "def __getstate__(self): return None":

Memory usage after loading normal: 33676 KB
Memory usage after loading optimized: 26392 KB

So you can see the Python 2.7 version of namedtuple makes the process use basically 3x the RAM when unpickled (without pickletools.optimize). Note that Python 2.6 does *not* do this (it doesn't define __dict__ or use OrderedDict so doesn't have this issue). And for some Python 3.3(.1) doesn't have the issue either, even though that does define __dict__ and use OrderedDict. I guess Python 3.3 does pickling (or garbage collection?) somewhat differently.

You can verify this yourself using the attached unpickletest.py script. Note that I'm running on Windows 7, but I presume this would happen on Linux/OS X too, as this issue has nothing to do with the OS. The script should work on non-Windows OSes, but you have to type in the RAM usage figures manually (using "top" or similar).

Note that I'm doing a gc.collect() just before fetching the memory usage figure just in case there's uncollected cyclical garbage floating around, and I didn't want that to affect the measurement.

I'm not sure I fully understand the cause (of where all this memory is going), or the fix for that matter. The OrderedDict is being pickled along with the namedtuple instance, because an OrderedDict is returned by __dict__, and pickle uses that. But is that staying in memory on unpickling? Why does optimizing the pickle fix the RAM usage issue to a large extent?

In any case, I've made the __getstate__ fix in our code, and that definitely fixes the RAM usage for us. (We're also going to be optimizing our pickles from now on.)
msg187862 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-04-26 14:59
I would like to call this a critical regression.
Under 2.7 and 3.2, all pickle protocols are affected.
Under 3.3 and 3.4, pickle protocols 0 and 1 are affected.

(unfortunately, 3.2 doesn't receive bugfixes anymore)
msg188294 - (view) Author: Roundup Robot (python-dev) Date: 2013-05-03 07:59
New changeset 18303391b981 by Raymond Hettinger in branch '2.7':
Issue #15535:  Fix regression in pickling of named tuples.
http://hg.python.org/cpython/rev/18303391b981
msg188296 - (view) Author: Roundup Robot (python-dev) Date: 2013-05-03 09:41
New changeset 65cd71abebc8 by Raymond Hettinger in branch '3.3':
Issue #15535:  Fix pickling of named tuples.
http://hg.python.org/cpython/rev/65cd71abebc8
msg188297 - (view) Author: Ben Hoyt (benhoyt) * Date: 2013-05-03 09:53
2.7 fix works for me, thanks! Just curious -- why the different fix for 3.3 (addition of __getstate__ instead of removal of __dict__)?
msg188299 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-05-03 10:07
> why the different fix for 3.3 

I reverted the 2.7.4 addition of __dict__ rather than introduce more differences between point releases with possible unintended effects.

In 3.3, the __dict__ attribute was there from the outset and was advertised in the docs, so it made more sense to leave it in and just suppress its inclusion in pickling.
msg188999 - (view) Author: Roundup Robot (python-dev) Date: 2013-05-12 10:32
New changeset 31eaf8a137ea by Georg Brandl in branch '3.2':
Issue #15535: Fix pickling of named tuples.
http://hg.python.org/cpython/rev/31eaf8a137ea
msg189610 - (view) Author: Anselm Kruis (anselm.kruis) * Date: 2013-05-19 17:51
>> why the different fix for 3.3 
>
> I reverted the 2.7.4 addition of __dict__ rather than introduce more
> differences between point releases with possible unintended effects.

__dict__ was a 2.7.3 addition (changeset 26d5f022eb1a). Now unpickling of named tuples created by 2.7.3 and 2.7.4 fails.
msg235495 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-02-06 20:31
What is left to do with this issue?
msg243882 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-05-23 02:32
> What is left to do with this issue?

Nothing that I can see.
History
Date User Action Args
2015-05-23 02:32:06rhettingersetstatus: open -> closed
resolution: fixed
messages: + msg243882
2015-02-06 20:31:18serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg235495
2013-05-19 22:22:03rhettingersetpriority: release blocker -> normal
status: closed -> open
resolution: fixed -> (no value)
versions: - Python 3.2, Python 3.3, Python 3.4
2013-05-19 17:51:40anselm.kruissetnosy: + anselm.kruis
messages: + msg189610
2013-05-12 10:32:39python-devsetmessages: + msg188999
2013-05-06 19:26:51barrysetnosy: + barry
2013-05-03 10:07:01rhettingersetmessages: + msg188299
2013-05-03 09:53:57benhoytsetmessages: + msg188297
2013-05-03 09:43:05rhettingersetstatus: open -> closed
assignee: amaury.forgeotdarc -> rhettinger
resolution: fixed
2013-05-03 09:42:40rhettingersetmessages: - msg188064
2013-05-03 09:41:58python-devsetmessages: + msg188296
2013-05-03 07:59:33python-devsetnosy: + python-dev
messages: + msg188294
2013-04-29 11:38:30floxsetnosy: + flox
2013-04-29 11:02:02rhettingersetmessages: - msg188063
2013-04-29 11:01:35rhettingersetassignee: rhettinger -> amaury.forgeotdarc
messages: + msg188064
2013-04-29 10:50:54rhettingersetmessages: + msg188063
2013-04-28 19:17:01georg.brandlsetversions: + Python 3.2
2013-04-26 15:41:33pitrousetnosy: + larry, georg.brandl
stage: needs patch

versions: + Python 3.3, Python 3.4
2013-04-26 14:59:31pitrousetpriority: high -> release blocker
nosy: + pitrou, benjamin.peterson
messages: + msg187862

2013-04-26 05:52:33rhettingersetpriority: normal -> high
2013-04-26 01:02:19benhoytsetfiles: + unpickletest.py
nosy: + benhoyt
messages: + msg187825

2012-10-03 12:53:44jceasetnosy: + jcea
2012-09-05 23:25:47amaury.forgeotdarcsetfiles: + namedtuple-pickle.diff
nosy: + amaury.forgeotdarc
messages: + msg169892

2012-09-01 00:29:19rhettingersetmessages: + msg169597
title: Fix pickling of named tuples in 2.7.3 (BUG) -> Fix pickling efficiency of named tuples in 2.7.3
2012-09-01 00:06:54rhettingersetassignee: rhettinger
2012-08-31 19:57:56thomiesetfiles: - namedtuple_pickle_fix.patch
2012-08-31 19:56:45thomiesetfiles: + namedtuple_pickle_fix.patch

messages: + msg169578
title: Fix pickling of named tuples in 2.7.3 -> Fix pickling of named tuples in 2.7.3 (BUG)
2012-08-15 10:31:23thomiesetfiles: + show_namedtuple_pickle_fix.py

messages: + msg168273
2012-08-02 13:44:45thomiecreate