This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author pitrou
Recipients daniel.urban, dmtr, eric.araujo, ezio.melotti, mark.dickinson, pitrou, rhettinger, terry.reedy
Date 2010-08-14.11:16:48
SpamBayes Score 7.5545677e-07
Marked as misclassified No
Message-id <1281784612.24.0.732941398549.issue9520@psf.upfronthosting.co.za>
In-reply-to
Content
> For example, on the x64 machine the following dict() mapping 
> 10,000,000 very short unicode keys (~7 chars) to integers eats 149
> bytes per entry. 

This is counting the keys too. Under 3.x:

>>> d = {}
>>> for i in range(0, 10000000): d[str(i)] = i
... 
>>> sys.getsizeof(d)
402653464
>>> sys.getsizeof(d) / len(d)
40.2653464

So, the dict itself uses ~40 bytes/entry. Since this is a 64-bit Python build, each entry uses three words of 8 bytes each, that is 24 bytes per entry (one word for the key, one word for the associated value, one word for the cached hash value). So, you see the ratio of allocated entries in the hash table over used entries is only a bit above 2, which is reasonable.

Do note that unicode objects themselves are not that compact:

>>> sys.getsizeof("1000000")
72

If you have many of them, you might use bytestrings instead:

>>> sys.getsizeof(b"1000000")
40

I've modified your benchmark to run under 3.x and will post it in a later message (I don't know whether bio.trie exists for 3.x, though).
History
Date User Action Args
2010-08-14 11:16:52pitrousetrecipients: + pitrou, rhettinger, terry.reedy, mark.dickinson, ezio.melotti, eric.araujo, daniel.urban, dmtr
2010-08-14 11:16:52pitrousetmessageid: <1281784612.24.0.732941398549.issue9520@psf.upfronthosting.co.za>
2010-08-14 11:16:50pitroulinkissue9520 messages
2010-08-14 11:16:49pitroucreate