Message 112937 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	dmtr
Recipients	dmtr
Date	2010-08-05.00:59:19
SpamBayes Score	5.334807e-07
Marked as misclassified	No
Message-id	<1280969963.37.0.950740145402.issue9520@psf.upfronthosting.co.za>
In-reply-to

Content
On large data sets (10-100 million keys) the default python dictionary implementation fails to meet memory and performance constraints. It also apparently fails to keep O(1) complexity (after just 1M keys). As such, there is a need for good, optimized, practical implementation of a high-performance container that can store such datasets. One of the alternatives is a regular Patricia Trie data structure. It can meet performance requirements on such datasets. Criteria: * strictly O(1); * works well on large datasets (~100M keys); memory efficient; * same or better performance as dict(); * supports regular dict \| counter interface; * supports unicode keys; * supports any characters in the keys; * persistent (can be pickled); * in-memory library; There are a few existing implementations available: * BioPython trie - http://biopython.org/ * py-radix - http://www.mindrot.org/projects/py-radix/ * PyPi trie - http://pypi.python.org/pypi/trie A few other relevant alternatives/implementations: * NIST trie - http://www.itl.nist.gov/div897/sqg/dads/HTML/patriciatree.html * C++ Trie Library - http://wikipedia-clustering.speedblue.org/trie.php * PyAvl trie - http://pypi.python.org/pypi/pyavl/1.12_1 * PyJudy tree - http://www.dalkescientific.com/Python/PyJudy.html * Redis - http://code.google.com/p/redis/ * Memcached - http://memcached.org/ An alternative to a basic Patricia Trie could be some state-of-the-art approach from some modern research (i.e. http://www.aclweb.org/anthology/W/W09/W09-1505.pdf ), The best existing implementation I've been able to find so far is one in the BioPython. Compared to defaultdict(int) on the task of counting words. Dataset 123,981,712 words (6,504,484 unique), 1..21 characters long: * bio.tree - 459 Mb/0.13 Hours, good O(1) behavior * defaultdict(int) - 693 Mb/0.32 Hours, poor, almost O(N) behavior At 8,000,0000 keys python defaultdict(int) starts showing almost O(N) behavior and gets unusable with 10,000,000+ unique keys. A few usage/installatio notes on BioPython trie: $ sudo apt-get install python-biopython >>> from Bio import trie >>> trieobj = trie.trie() >>> trieobj["hello"] = 5 >>> trieobj["hello"] += 1 >>> print trieobj["hello"] >>> print trieobj.keys() More examples at: http://python-biopython.sourcearchive.com/documentation/1.54/test__triefind_8py-source.html

On large data sets (10-100 million keys) the default python dictionary implementation fails to meet memory and performance constraints. It also apparently fails to keep O(1) complexity (after just 1M keys). As such, there is a need for good, optimized, practical implementation of a high-performance container that can store such datasets. 

One of the alternatives is a regular Patricia Trie data structure. It can meet performance requirements on such datasets. Criteria:
* strictly O(1);
* works well on large datasets (~100M keys); memory efficient;
* same or better performance as dict();
* supports regular dict | counter interface;
* supports unicode keys;
* supports any characters in the keys;
* persistent (can be pickled); 
* in-memory library;

There are a few existing implementations available:
* BioPython trie - http://biopython.org/
* py-radix - http://www.mindrot.org/projects/py-radix/
* PyPi trie - http://pypi.python.org/pypi/trie

A few other relevant alternatives/implementations:
* NIST trie - http://www.itl.nist.gov/div897/sqg/dads/HTML/patriciatree.html
* C++ Trie Library - http://wikipedia-clustering.speedblue.org/trie.php
* PyAvl trie - http://pypi.python.org/pypi/pyavl/1.12_1
* PyJudy tree - http://www.dalkescientific.com/Python/PyJudy.html
* Redis - http://code.google.com/p/redis/
* Memcached - http://memcached.org/

An alternative to a basic Patricia Trie could be some state-of-the-art approach from some modern research (i.e. http://www.aclweb.org/anthology/W/W09/W09-1505.pdf ), 


The best existing implementation I've been able to find so far is one in the BioPython. Compared to defaultdict(int) on the task of counting words. Dataset 123,981,712 words (6,504,484 unique), 1..21 characters long:
* bio.tree - 459 Mb/0.13 Hours, good O(1) behavior
* defaultdict(int) - 693 Mb/0.32 Hours, poor, almost O(N) behavior

At 8,000,0000 keys python defaultdict(int) starts showing almost O(N) behavior and gets unusable with 10,000,000+ unique keys. 



A few usage/installatio notes on BioPython trie:
$ sudo apt-get install python-biopython

>>> from Bio import trie
>>> trieobj = trie.trie()
>>> trieobj["hello"] = 5
>>> trieobj["hello"] += 1
>>> print trieobj["hello"]
>>> print trieobj.keys()

More examples at:
http://python-biopython.sourcearchive.com/documentation/1.54/test__triefind_8py-source.html

History
Date	User	Action	Args
2010-08-05 00:59:23	dmtr	set	recipients: + dmtr
2010-08-05 00:59:23	dmtr	set	messageid: <1280969963.37.0.950740145402.issue9520@psf.upfronthosting.co.za>
2010-08-05 00:59:21	dmtr	link	issue9520 messages
2010-08-05 00:59:19	dmtr	create