This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author dmtr
Recipients daniel.urban, dmtr, eric.araujo, ezio.melotti, mark.dickinson, pitrou, rhettinger, terry.reedy
Date 2010-08-08.22:02:22
SpamBayes Score 8.3266727e-16
Marked as misclassified No
Message-id <1281304947.33.0.573905165906.issue9520@psf.upfronthosting.co.za>
In-reply-to
Content
Yes. Data containers optimized for very large datasets, compactness and strict adherence to O(1) can be beneficial. 

Python have great high performance containers, but there is a certain lack of compact ones. For example, on the x64 machine the following dict() mapping 10,000,000 very short unicode keys (~7 chars) to integers eats 149 bytes per entry. 
>>> import os, re
>>> d = dict()
>>> for i in xrange(0, 10000000): d[unicode(i)] = i
>>> print re.findall("(VmPeak.*|VmSize.*)", open('/proc/%d/status' % os.getpid()).read())
['VmPeak:\t 1458324 kB', 'VmSize:\t 1458324 kB']

I can understand that there are all kinds of reasons why it is so and even why it is good. But having an unobtrusive *compact* container could be nice (although you'd be most welcome if you could tweak default containers, so they would adjust to the large datasets appropriately),

Also I can't emphasize more that compactness is still important sometimes. Modern days datasets are getting larger and larger (literally terabytes) and 'just add more memory' strategy is not always feasible. 


Regarding the dict() violation of O(1). So far I was unable to reproduce it in the test. I can certainly see it on the real dataset, and trust me it was very annoying to see ETA 10 hours going down to 8 hours and then gradually up again to 17 hours and hanging there. This was _solved_ by switching from dict() to Bio.trie(). So this problem certainly had something to do with dict(). I don't know what is causing it though.
History
Date User Action Args
2010-08-08 22:02:27dmtrsetrecipients: + dmtr, rhettinger, terry.reedy, mark.dickinson, pitrou, ezio.melotti, eric.araujo, daniel.urban
2010-08-08 22:02:27dmtrsetmessageid: <1281304947.33.0.573905165906.issue9520@psf.upfronthosting.co.za>
2010-08-08 22:02:24dmtrlinkissue9520 messages
2010-08-08 22:02:23dmtrcreate