classification
Title: Dict untracking can result in quadratic dict build-up
Type: performance Stage: resolved
Components: Versions: Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: pitrou, python-dev, stw
Priority: normal Keywords: patch

Created on 2012-05-10 20:03 by stw, last changed 2012-05-28 20:33 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
make_file.py stw, 2012-05-10 20:03 Create file test.pickle
load_file.py stw, 2012-05-10 20:04
dict_untrack.patch pitrou, 2012-05-22 05:26 review
untracking_docs.patch stw, 2012-05-25 23:49 review
Messages (19)
msg160368 - (view) Author: Tim Silk (stw) * Date: 2012-05-10 20:03
I've found that unpickling a certain kind of dictionary is substantially slower in python 2.7 compared to python 2.6. The dictionary has keys that are tuples of strings - a 1-tuple is enough to see the effect. The problem seems to be caused by garbage collection, as turning it off eliminates the slowdown. Both pickle and cPickle modules are affected.


I've attached two files to demonstrate this. The file 'make_file.py'
creates a dictionary of specified size, with keys containing 1-tuples of random strings. It then dumps the dictionary to a pickle file using a specified pickle module.

The file 'load_file.py' unpickles the file created by 'make_file.py', using a specified pickle module, and prints the time taken. The code can be run with garbage collection either on or off.

The results below are for a dictionary of 200000 entries. Each entry is the time taken in seconds with garbage collection on / garbage collection off. The row headings are the module used to pickle the data, the column headings the module used to unpickle it.


python 2.6, n = 200000

               size    pickle      cPickle
    pickle     4.3M    3.02/2.65   0.786/0.559 
    cPickle    3.4M    2.27/2.04   0.66/0.443 


python 2.7, n = 200000

               size    pickle      cPickle
    pickle     4.3M    10.5/2.67   6.62/0.563 
    cPickle    2.4M    1.45/1.39   0.362/0.325 


When pickle is used to pickle the data, there is a significant slowdown in python 2.7 compared to python 2.6 with garbage collection on. With garbage collection off the times in python 2.7 are essentially identical to those in python 2.6.

When cPickle is used to pickle the data, both unpicklers are faster in python 2.7 than in python 2.6. Presumably the speedup is due to the dictionary optimizations introduced from issue #5670.


Both pickle and cPickle show a slowdown when data pickled in python 2.6 is unpickled in python 2.7:


pickled in python 2.6, unpickled in python 2.7, n = 200000

                      size    pickle (2.7)    cPickle (2.7)
    pickle (2.6)      4.3M    10.4/2.66       6.64/0.56 
    cPickle (2.6)     3.4M    8.73/2.08       6.1/0.452 


I don't know enough about the internals of the pickle modules or garbage collector to offer an explanation/fix. The list of optimizations for python 2.7 indicates changes to both pickle modules (issues #5670 and #5084) and the garbage collector (issues #4074 and #4688). It seems possible that the slowdown is the result of some interaction between these changes.


Further notes:

1. System details: python 2.6.5 and python 2.7.3 on Ubuntu 10.04, 1.73GHz Pentium M processor.

2. Only pickle files created with protocols 1 and 2 are affected. Pickling with protocol 0 gives similar timings on python 2.6 and 2.7.

3. The fact that the dictionary's keys are tuples is relevant, although the length of the tuple is not. Unpickling a dictionary whose keys are strings does not show any slowdown.
msg160369 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-10 20:15
> When pickle is used to pickle the data, there is a significant slowdown

> Both pickle and cPickle show a slowdown when data pickled in python 2.6 is unpickled in python 2.7:

This sounds rather weird. Presumably the structure of the pickle streams shouldn't have changed from 2.6 to 2.7.
You can use the pickletools.dis() function to disassemble a pickle stream, and examine whether there are any differences between the various files.
msg161293 - (view) Author: Tim Silk (stw) * Date: 2012-05-21 20:30
Thanks for the nod to the pickletools module. The structure of the streams do change between python 2.6 and 2.7, at least for files created with cPickle. You can see this from the file sizes that I quoted in my previous post. Below are some snippets of the disassemblies of pickle files generated using make_file.py:

pickle, python 2.6 and 2.7:

    0: \x80 PROTO      2
    2: }    EMPTY_DICT
    3: q    BINPUT     0
    5: (    MARK
    6: U        SHORT_BINSTRING 'ArLLX'
   13: q        BINPUT     1
   15: \x85     TUPLE1
   16: q        BINPUT     2
   18: K        BININT1    0
   ...

cPickle, python 2.6:

    0: \x80 PROTO      2
    2: }    EMPTY_DICT
    3: q    BINPUT     1
    5: (    MARK
    6: U        SHORT_BINSTRING 'XYoGB'
   13: \x85     TUPLE1
   14: q        BINPUT     2
   16: K        BININT1    0
   ...

cPickle, python 2.7:

    0: \x80 PROTO      2
    2: }    EMPTY_DICT
    3: q    BINPUT     1
    5: (    MARK
    6: U        SHORT_BINSTRING 'QGgRa'
   13: \x85     TUPLE1
   14: K        BININT1    1
   ...

As you can see, pickle memoizes both the string and the tuple. cPickle2.6 just memoizes the tuple, while cPickle2.7 doesn't memoize either. It therefore seems that the problem is somehow related to the memo...
msg161295 - (view) Author: Tim Silk (stw) * Date: 2012-05-21 20:36
One other thing - python 2.7 added the is_tracked function to the gc module. As I understand it (from issue #4074) tuples of atomic types are supposed to be untracked. However, in python 2.7:

>>> gc.is_tracked((1, 2))
True
>>> gc.is_tracked("hello")
True
msg161302 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-21 22:20
> As I understand it (from issue #4074) tuples of atomic types are supposed to be untracked.

Actually, it is done lazily:

>>> t = 1,2
>>> gc.is_tracked(t)
True
>>> gc.collect()
0
>>> gc.is_tracked(t)
False
msg161324 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-22 05:26
Okay, found it. The problem is _PyDict_MaybeUntrack() is potentially O(n), so calling it in every GC collection can produce quadratic runtimes when a dict is building up. Attached patch (for 3.3) only calls it in the older collections. Not sure this should be backported to 2.7/3.2.
msg161340 - (view) Author: Tim Silk (stw) * Date: 2012-05-22 08:57
I'd come to the same conclusion - as the new dict is built up (using batch build) it keeps appearing in generation 0, and the gc has to walk over it to determine that it should be untracked. 

However, this only seems to be true if the pickle file is created using pickle - it doesn't happen with files generated with cPickle. With cPickle the dict remains tracked, and passes from generation 0 to 1 to 2. The only difference is that pickle memoizes the tuples, while cPickle doesn't. Why does the memoization make a difference?
msg161347 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-22 12:35
> However, this only seems to be true if the pickle file is created
> using pickle - it doesn't happen with files generated with cPickle.
> With cPickle the dict remains tracked, and passes from generation 0 to
> 1 to 2. The only difference is that pickle memoizes the tuples, while
> cPickle doesn't. Why does the memoization make a difference?

Probably because memoization itself uses a dict.
msg161348 - (view) Author: Tim Silk (stw) * Date: 2012-05-22 13:06
> Probably because memoization itself uses a dict.

Right, but as far as I can tell it's not the memo dict that keeps being tracked/untracked. Rather, it's the dict that is being unpickled.

Anyway, I suppose the point is that the issue of whether an object is tracked/untracked is not solely determined by its type:

1. All containers are tracked by default.

2. Tuples can only become untracked after a generation 0 gc pass.

3. With the new patch, dicts can only become untracked after a generation 2 gc pass.

4. Also, am I right in thinking that whether a container gets untracked or not depends not only on its contents, but also on the order of the objects in the gc list? That is, all of the contents of a container must get untracked before the container is considered.
msg161363 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-22 16:20
> Also, am I right in thinking that whether a container gets untracked or 
> not depends not only on its contents, but also on the order of the
> objects in the gc list? That is, all of the contents of a container
> must get untracked before the container is considered.

Indeed, and it could explain the difference in behaviour between the memo dict and the other dict.
msg161421 - (view) Author: Tim Silk (stw) * Date: 2012-05-23 15:30
I had a thought about untracking tuples. If a tuple contains only immutable objects (atomics and tuples of atomics etc), then it should be untracked. Once untracked, it will never need to be tracked again since the tuple is immutable. If a tuple contains mutable objects, it will always need to be tracked.

I was wondering whether it is possible to determine whether a tuple needs to be tracked or not the first time it appears in generation 0 - tuples in older generations would then not need to be considered.
msg161425 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-23 15:46
> I had a thought about untracking tuples. If a tuple contains only
> immutable objects (atomics and tuples of atomics etc), then it should
> be untracked. Once untracked, it will never need to be tracked again
> since the tuple is immutable. If a tuple contains mutable objects, it
> will always need to be tracked.

True. However, some tuples may be in an unfinished state (they are being
built up internally and a GC collection occurred in the middle).

> I was wondering whether it is possible to determine whether a tuple
> needs to be tracked or not the first time it appears in generation 0 -
> tuples in older generations would then not need to be considered.

I'm not sure that would make much of a difference in practice. Most
tuples are very short, so checking them for untracking should be
reasonably fast.
msg161434 - (view) Author: Tim Silk (stw) * Date: 2012-05-23 16:22
> I had a thought about untracking tuples. If a tuple contains only
> immutable objects (atomics and tuples of atomics etc), then it should
> be untracked. Once untracked, it will never need to be tracked again
> since the tuple is immutable. If a tuple contains mutable objects, it
> will always need to be tracked.

> True. However, some tuples may be in an unfinished state (they are 
> being built up internally and a GC collection occurred in the middle).

So the tuple is linked-in to the garbage collection list before its contents are constructed?

> I was wondering whether it is possible to determine whether a tuple
> needs to be tracked or not the first time it appears in generation 0 -
> tuples in older generations would then not need to be considered.

> I'm not sure that would make much of a difference in practice. Most
> tuples are very short, so checking them for untracking should be
> reasonably fast.

Could tuples not be untracked at creation time then, or do not enough survive to gc to make this worthwhile?
msg161436 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-23 16:26
Le mercredi 23 mai 2012 à 16:22 +0000, stw a écrit :
> So the tuple is linked-in to the garbage collection list before its
> contents are constructed?

It is. It typically happens when you do (in C code):

PyObject *my_tuple = PyTuple_New(2);
/* compute some_object ... */
PyTuple_SET_ITEM(my_tuple, 0, some_object);
/* compute some_other_object ... */
PyTuple_SET_ITEM(my_tuple, 1, some_other_object);

> Could tuples not be untracked at creation time then, or do not enough
> survive to gc to make this worthwhile?

The latter. Also, tuples are created at a high rate in the interpreter,
so it can actually decrease performance.
msg161630 - (view) Author: Tim Silk (stw) * Date: 2012-05-25 23:49
> So the tuple is linked-in to the garbage collection list before its
> contents are constructed?

> It is. It typically happens when you do (in C code):

Ok, thanks. I couldn't see how a tuple could be created before its contents in python code, but it is clearly possible in C.


I have written up some documentation on how untracking is handled in the gc - please see the attached patch. It summarises our discussion and some of the posts in issue #4688.
msg161678 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-26 19:15
> I have written up some documentation on how untracking is handled in
> the gc - please see the attached patch.

Thank you for doing this! Do you have a real name so that I can credit you?
Also, could you fill a contributor agreement? http://www.python.org/psf/contrib/
msg161802 - (view) Author: Tim Silk (stw) * Date: 2012-05-28 20:12
> Thank you for doing this! 

No problem. I've been thinking about getting involved with python development for a while - this seemed a good place to start!

> Do you have a real name so that I can credit you?

Yes (thankfully) - I've added it to my profile.

> Also, could you fill a contributor agreement? 
> http://www.python.org /psf/contrib/

Yes, will do.
msg161804 - (view) Author: Roundup Robot (python-dev) Date: 2012-05-28 20:31
New changeset 47e6217d0e84 by Antoine Pitrou in branch '3.2':
Issue #14775: Fix a potential quadratic dict build-up due to the garbage collector repeatedly trying to untrack dicts.
http://hg.python.org/cpython/rev/47e6217d0e84

New changeset 7951900afd00 by Antoine Pitrou in branch 'default':
Issue #14775: Fix a potential quadratic dict build-up due to the garbage collector repeatedly trying to untrack dicts.
http://hg.python.org/cpython/rev/7951900afd00

New changeset 6554ebbeb2f3 by Antoine Pitrou in branch '2.7':
Issue #14775: Fix a potential quadratic dict build-up due to the garbage collector repeatedly trying to untrack dicts.
http://hg.python.org/cpython/rev/6554ebbeb2f3
msg161805 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-28 20:33
> No problem. I've been thinking about getting involved with python
> development for a while - this seemed a good place to start!

It is. You can also subscribe to the core-mentorship mailing-list if you want: http://mail.python.org/mailman/listinfo/core-mentorship

I've now committed the patches. Thanks for your contribution!

(if by chance it doesn't solve the issue, feel free to re-open it)
History
Date User Action Args
2012-05-28 20:33:09pitrousetstatus: open -> closed
resolution: fixed
messages: + msg161805

stage: patch review -> resolved
2012-05-28 20:31:24python-devsetnosy: + python-dev
messages: + msg161804
2012-05-28 20:12:00stwsetmessages: + msg161802
2012-05-26 19:15:08pitrousetmessages: + msg161678
2012-05-25 23:49:50stwsetfiles: + untracking_docs.patch

messages: + msg161630
2012-05-23 16:26:04pitrousetmessages: + msg161436
2012-05-23 16:22:39stwsetmessages: + msg161434
2012-05-23 15:46:19pitrousetmessages: + msg161425
2012-05-23 15:30:29stwsetmessages: + msg161421
2012-05-22 16:20:38pitrousetmessages: + msg161363
2012-05-22 13:06:26stwsetmessages: + msg161348
2012-05-22 12:35:09pitrousetmessages: + msg161347
2012-05-22 08:57:03stwsetmessages: + msg161340
2012-05-22 05:26:18pitrousetfiles: + dict_untrack.patch
versions: + Python 3.2, Python 3.3
title: Slow unpickling of certain dictionaries in python 2.7 vs python 2.6 -> Dict untracking can result in quadratic dict build-up
messages: + msg161324

keywords: + patch
stage: patch review
2012-05-21 22:20:07pitrousetmessages: + msg161302
2012-05-21 20:36:24stwsetmessages: + msg161295
2012-05-21 20:30:31stwsetmessages: + msg161293
2012-05-10 20:15:20pitrousetnosy: + pitrou
messages: + msg160369
2012-05-10 20:04:48stwsetfiles: + load_file.py
2012-05-10 20:03:53stwcreate