Message 113886 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	pitrou
Recipients	daniel.urban, dmtr, eric.araujo, ezio.melotti, mark.dickinson, pitrou, rhettinger, terry.reedy
Date	2010-08-14.11:16:48
SpamBayes Score	7.5545677e-07
Marked as misclassified	No
Message-id	<1281784612.24.0.732941398549.issue9520@psf.upfronthosting.co.za>
In-reply-to

Content
> For example, on the x64 machine the following dict() mapping > 10,000,000 very short unicode keys (~7 chars) to integers eats 149 > bytes per entry. This is counting the keys too. Under 3.x: >>> d = {} >>> for i in range(0, 10000000): d[str(i)] = i ... >>> sys.getsizeof(d) 402653464 >>> sys.getsizeof(d) / len(d) 40.2653464 So, the dict itself uses ~40 bytes/entry. Since this is a 64-bit Python build, each entry uses three words of 8 bytes each, that is 24 bytes per entry (one word for the key, one word for the associated value, one word for the cached hash value). So, you see the ratio of allocated entries in the hash table over used entries is only a bit above 2, which is reasonable. Do note that unicode objects themselves are not that compact: >>> sys.getsizeof("1000000") 72 If you have many of them, you might use bytestrings instead: >>> sys.getsizeof(b"1000000") 40 I've modified your benchmark to run under 3.x and will post it in a later message (I don't know whether bio.trie exists for 3.x, though).

> For example, on the x64 machine the following dict() mapping 
> 10,000,000 very short unicode keys (~7 chars) to integers eats 149
> bytes per entry. 

This is counting the keys too. Under 3.x:

>>> d = {}
>>> for i in range(0, 10000000): d[str(i)] = i
... 
>>> sys.getsizeof(d)
402653464
>>> sys.getsizeof(d) / len(d)
40.2653464

So, the dict itself uses ~40 bytes/entry. Since this is a 64-bit Python build, each entry uses three words of 8 bytes each, that is 24 bytes per entry (one word for the key, one word for the associated value, one word for the cached hash value). So, you see the ratio of allocated entries in the hash table over used entries is only a bit above 2, which is reasonable.

Do note that unicode objects themselves are not that compact:

>>> sys.getsizeof("1000000")
72

If you have many of them, you might use bytestrings instead:

>>> sys.getsizeof(b"1000000")
40

I've modified your benchmark to run under 3.x and will post it in a later message (I don't know whether bio.trie exists for 3.x, though).

History
Date	User	Action	Args
2010-08-14 11:16:52	pitrou	set	recipients: + pitrou, rhettinger, terry.reedy, mark.dickinson, ezio.melotti, eric.araujo, daniel.urban, dmtr
2010-08-14 11:16:52	pitrou	set	messageid: <1281784612.24.0.732941398549.issue9520@psf.upfronthosting.co.za>
2010-08-14 11:16:50	pitrou	link	issue9520 messages
2010-08-14 11:16:49	pitrou	create