Title: Add Patricia Trie high performance container
Type: enhancement Stage:
Components: Library (Lib) Versions: Python 3.2
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: daniel.urban, djc, dmtr, eric.araujo, ezio.melotti, lukasz.langa, mark.dickinson, pitrou, rhettinger, terry.reedy
Priority: normal Keywords:

Created on 2010-08-05 00:59 by dmtr, last changed 2010-11-14 00:50 by rhettinger. This issue is now closed.

File name Uploaded Description Edit dmtr, 2010-08-06 01:40 Dict benchmark. Synthetic test case. dmtr, 2010-08-13 21:50 dict/trie benchmark & test code. pitrou, 2010-08-14 11:25
Messages (13)
msg112937 - (view) Author: Dmitry Chichkov (dmtr) Date: 2010-08-05 00:59
On large data sets (10-100 million keys) the default python dictionary implementation fails to meet memory and performance constraints. It also apparently fails to keep O(1) complexity (after just 1M keys). As such, there is a need for good, optimized, practical implementation of a high-performance container that can store such datasets. 

One of the alternatives is a regular Patricia Trie data structure. It can meet performance requirements on such datasets. Criteria:
* strictly O(1);
* works well on large datasets (~100M keys); memory efficient;
* same or better performance as dict();
* supports regular dict | counter interface;
* supports unicode keys;
* supports any characters in the keys;
* persistent (can be pickled); 
* in-memory library;

There are a few existing implementations available:
* BioPython trie -
* py-radix -
* PyPi trie -

A few other relevant alternatives/implementations:
* NIST trie -
* C++ Trie Library -
* PyAvl trie -
* PyJudy tree -
* Redis -
* Memcached -

An alternative to a basic Patricia Trie could be some state-of-the-art approach from some modern research (i.e. ), 

The best existing implementation I've been able to find so far is one in the BioPython. Compared to defaultdict(int) on the task of counting words. Dataset 123,981,712 words (6,504,484 unique), 1..21 characters long:
* bio.tree - 459 Mb/0.13 Hours, good O(1) behavior
* defaultdict(int) - 693 Mb/0.32 Hours, poor, almost O(N) behavior

At 8,000,0000 keys python defaultdict(int) starts showing almost O(N) behavior and gets unusable with 10,000,000+ unique keys. 

A few usage/installatio notes on BioPython trie:
$ sudo apt-get install python-biopython

>>> from Bio import trie
>>> trieobj = trie.trie()
>>> trieobj["hello"] = 5
>>> trieobj["hello"] += 1
>>> print trieobj["hello"]
>>> print trieobj.keys()

More examples at:
msg112945 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-08-05 03:01
I think it would be better to propose this on the python-ideas mailing list.
msg112965 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-08-05 09:51
Thanks for the request Dmitry. You may want to read to know which steps are required to add a module to the stdlib. In particular, the module has to be proven on the field and the author needs to sign a contributor agreement. It’s not clear which implementation you want to add (the one from BioPython I guess); I don’t know if this proposal will fly if you’re not the author of the module. (I don’t want to scare you away, just to explain the process.)

If you contact the author of the code you want to include and they agree with your idea, then it’ll be time to convince python-ideas that this would be a generally useful addition and that the chosen class is the best of breed.

A note about versions: I removed 3.1 since new features don’t go in stable releases, and 3.3 since it does not exist yet. Everything in 3.2 will go in 3.3 too: this value is useful only to mean “won’t go in 3.2, remind later”.

I also edited the title to make it shorter, to make emails and reports more readable. Thanks again for your report.
msg112967 - (view) Author: Dmitry Chichkov (dmtr) Date: 2010-08-05 10:11
Thank you for your comment. Perhaps we should try separate this into two issues:
1) Bug. Python's dict() is unusable on datasets with 10,000,000+ keys. Here I should provide a solid test case showing a deviation from O(1);

2) Feature request/idea. Add Patricia Trie high performance container.
msg112970 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-08-05 11:12
> 1) Bug. Python's dict() is unusable on datasets with 10,000,000+ keys.
> Here I should provide a solid test case showing a deviation from O(1);

That would be helpful.  Are you sure that the slow-down you're seeing isn't simply due to running out of system memory?
msg112973 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-08-05 12:06
> 1) Bug. Python's dict() is unusable on datasets with 10,000,000+ keys. 
> Here I should provide a solid test case showing a deviation from O(1);

Try to disable the cyclic garbage collector before the test (gc.disable()).
msg113060 - (view) Author: Dmitry Chichkov (dmtr) Date: 2010-08-06 01:40
No. I'm not simply running out of system memory. 8Gb/x64/linux. And in my test cases I've only seen ~25% of memory utilized. And good idea. I'll try to play with the cyclic garbage collector.

It is harder than I thought to make a solid synthetic test case addressing that issue. The trouble you need to be able to generate data (e.g. 100,000,000 words/5,000,000 unique) with a distribution close to that in the real life scenario (e.g. word lengths, frequencies and uniqueness in the english text). If somebody have a good idea onto how to do it nicely - you'd be very welcome. 

My best shot so far is in the attachment.
msg113313 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-08-08 20:27
A decade ago, 8gig memory was rare, now it is almost standard for high-end consumer machines. People are now doing bigger computations and stressing the implementation more. So it is plausible that we need to either tweak the core class implementations or (non-exclusively) add specialize structures optimized for certain tasks.

However, please to not throw the word 'bug' around lightly. When inappropriate, it only gets in the way. The language standard intentionally omits performance guarantees, especially when running on fragmented finite memory with multiple cache levels and possible OS limitations.
msg113337 - (view) Author: Dmitry Chichkov (dmtr) Date: 2010-08-08 22:02
Yes. Data containers optimized for very large datasets, compactness and strict adherence to O(1) can be beneficial. 

Python have great high performance containers, but there is a certain lack of compact ones. For example, on the x64 machine the following dict() mapping 10,000,000 very short unicode keys (~7 chars) to integers eats 149 bytes per entry. 
>>> import os, re
>>> d = dict()
>>> for i in xrange(0, 10000000): d[unicode(i)] = i
>>> print re.findall("(VmPeak.*|VmSize.*)", open('/proc/%d/status' % os.getpid()).read())
['VmPeak:\t 1458324 kB', 'VmSize:\t 1458324 kB']

I can understand that there are all kinds of reasons why it is so and even why it is good. But having an unobtrusive *compact* container could be nice (although you'd be most welcome if you could tweak default containers, so they would adjust to the large datasets appropriately),

Also I can't emphasize more that compactness is still important sometimes. Modern days datasets are getting larger and larger (literally terabytes) and 'just add more memory' strategy is not always feasible. 

Regarding the dict() violation of O(1). So far I was unable to reproduce it in the test. I can certainly see it on the real dataset, and trust me it was very annoying to see ETA 10 hours going down to 8 hours and then gradually up again to 17 hours and hanging there. This was _solved_ by switching from dict() to Bio.trie(). So this problem certainly had something to do with dict(). I don't know what is causing it though.
msg113886 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-08-14 11:16
> For example, on the x64 machine the following dict() mapping 
> 10,000,000 very short unicode keys (~7 chars) to integers eats 149
> bytes per entry. 

This is counting the keys too. Under 3.x:

>>> d = {}
>>> for i in range(0, 10000000): d[str(i)] = i
>>> sys.getsizeof(d)
>>> sys.getsizeof(d) / len(d)

So, the dict itself uses ~40 bytes/entry. Since this is a 64-bit Python build, each entry uses three words of 8 bytes each, that is 24 bytes per entry (one word for the key, one word for the associated value, one word for the cached hash value). So, you see the ratio of allocated entries in the hash table over used entries is only a bit above 2, which is reasonable.

Do note that unicode objects themselves are not that compact:

>>> sys.getsizeof("1000000")

If you have many of them, you might use bytestrings instead:

>>> sys.getsizeof(b"1000000")

I've modified your benchmark to run under 3.x and will post it in a later message (I don't know whether bio.trie exists for 3.x, though).
msg113887 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-08-14 11:25
So, here is the modified benchmark. It first creates a cache of the random wordlist, because it is quite long to compute (for N up to 10000000). The cache is reused in subsequent runs. This takes some memory though (probably won't run it if you have less than 2GB). The cache itself is quite small on-disk (around 80 MB).

I'm outputting the total dict size and the space occupied per individual key, and I've also timed lookups separately. Here are results on my machine:

    1000 words (     961 keys), 3269137 words/s, 16980987 lookups/s, 51 bytes/key (0.0MB)
   10000 words (    9042 keys), 2697648 words/s, 13680052 lookups/s, 87 bytes/key (0.8MB)
  100000 words (   83168 keys), 2462269 words/s, 6956074 lookups/s, 37 bytes/key (3.0MB)
  500000 words (  389442 keys), 1802431 words/s, 4733774 lookups/s, 64 bytes/key (24.0MB)
 1000000 words (  755372 keys), 1702130 words/s, 4485229 lookups/s, 66 bytes/key (48.0MB)
 2000000 words ( 1463359 keys), 1616658 words/s, 4251021 lookups/s, 68 bytes/key (96.0MB)
 5000000 words ( 3501140 keys), 1608889 words/s, 3909212 lookups/s, 57 bytes/key (192.0MB)
10000000 words ( 6764089 keys), 1531136 words/s, 3600395 lookups/s, 59 bytes/key (384.0MB)

As you can see, the O(1) behaviour seems almost observed up to the 10000000 words limit (given the way the data is computed, I'm afraid I can't go further than that). Of course, very small dicts are faster because everything fits in the CPU caches.
msg113937 - (view) Author: Dmitry Chichkov (dmtr) Date: 2010-08-15 00:31
Yes, it looks like you are right. And while there is some slight performance degradation, at least nothing drastic is happening up to 30M keys. Using your modified test:

    1000 words (     961 keys), 3609555 words/s, 19239926 lookups/s, 51 bytes/key (0.0MB)
   10000 words (    9042 keys), 3390980 words/s, 18180771 lookups/s, 87 bytes/key (0.0MB)
  100000 words (   83168 keys), 3298809 words/s, 12509481 lookups/s, 37 bytes/key (3.0MB)
 1000000 words (  755372 keys), 2477793 words/s, 7537963 lookups/s, 66 bytes/key (48.0MB)
 5000000 words ( 3501140 keys), 2291004 words/s, 6487727 lookups/s, 57 bytes/key (192.0MB)
10000000 words ( 6764089 keys), 2238081 words/s, 6216454 lookups/s, 59 bytes/key (384.0MB)
20000000 words (13061821 keys), 2175688 words/s, 5817085 lookups/s, 61 bytes/key (768.0MB)
50000000 words (31188460 keys), 2117751 words/s, 5137209 lookups/s, 51 bytes/key (1536.0MB)

It looks like internal memory estimates (sys.getsizeof()) are also pretty good:

Before: ['VmPeak:\t10199764 kB', 'VmSize:\t 5552416 kB']
50000000 words (31188460 keys), 2117751 words/s, 5137209 lookups/s, 51 bytes/key (1536.0MB)
After: ['VmPeak:\t10199764 kB', 'VmSize:\t 7125284 kB']
7 125 284 kB - 5 552 416 kB = 1 536.00391 MB
msg121169 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-11-14 00:50
I don't think a Patricia Trie is going to find its way into the code distribution.  It has a better chance as a third-party module listed on PyPI (much in the same way that people access memcached from Python).
Date User Action Args
2010-11-14 00:50:41rhettingersetstatus: open -> closed
resolution: rejected
messages: + msg121169
2010-11-12 00:53:24lukasz.langasetnosy: + lukasz.langa
2010-11-11 17:45:20rhettingersetassignee: rhettinger
2010-11-11 08:02:39djcsetnosy: + djc
2010-08-15 00:31:21dmtrsetmessages: + msg113937
2010-08-14 11:25:49pitrousetfiles: +

messages: + msg113887
2010-08-14 11:16:50pitrousetmessages: + msg113886
2010-08-13 21:50:33dmtrsetfiles: +
2010-08-08 22:02:24dmtrsetmessages: + msg113337
2010-08-08 20:27:01terry.reedysetnosy: + terry.reedy
messages: + msg113313
2010-08-06 20:35:00daniel.urbansetnosy: + daniel.urban
2010-08-06 01:40:21dmtrsetfiles: +

messages: + msg113060
2010-08-05 12:06:17pitrousetnosy: + pitrou
messages: + msg112973
2010-08-05 11:12:11mark.dickinsonsetnosy: + mark.dickinson
messages: + msg112970
2010-08-05 10:11:43dmtrsetmessages: + msg112967
2010-08-05 09:51:36eric.araujosettype: performance -> enhancement
2010-08-05 09:51:10eric.araujosettitle: Add Patricia Trie high performance container (python's defaultdict(int) is unusable on datasets with 10,000,000+ keys.) -> Add Patricia Trie high performance container
nosy: + eric.araujo

messages: + msg112965

type: enhancement -> performance
2010-08-05 03:01:19ezio.melottisetversions: - Python 3.1, Python 3.3
nosy: + rhettinger, ezio.melotti

messages: + msg112945

type: performance -> enhancement
2010-08-05 00:59:21dmtrcreate