This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author vstinner
Recipients Arach, Arfrever, Huzaifa.Sidhpurwala, Jim.Jewett, Mark.Shannon, PaulMcMillan, Zhiping.Deng, alex, barry, benjamin.peterson, christian.heimes, dmalcolm, eric.araujo, eric.snow, fx5, georg.brandl, grahamd, gregory.p.smith, gvanrossum, gz, jcea, lemburg, mark.dickinson, neologix, pitrou, skrah, terry.reedy, tim.peters, v+python, vstinner, zbysz
Date 2012-01-19.13:13:41
SpamBayes Score 7.903199e-06
Marked as misclassified No
Message-id <CAMpsgwbsVX0g9o40EvjXbXt2FBugdd35AkH7zFOPVf99VNqB8A@mail.gmail.com>
In-reply-to <4F176E88.3090503@udel.edu>
Content
I tried the collision counting with a low number of collisions:

less than 15 collisions
-----------------------

Fail at startup.

5 collisions (32 buckets, 21 used=65.6%): hash=ceb3152f => f
10 collisions (32 buckets, 21 used=65.6%): hash=ceb3152f => f

dict((str(k), 0) for k in range(2000000))
-----------------------------------------

15 collisions (32,768 buckets, 18024 used=55.0%): hash=0e4631d2 => 31d2
20 collisions (131,072 buckets, 81568 used=62.2%): hash=12660719 => 719
25 collisions (1,048,576 buckets, 643992 used=61.4%): hash=6a1f6d21 => f6d21
30 collisions (1,048,576 buckets, 643992 used=61.4%): hash=6a1f6d21 => f6d21
35 collisions => ? (more than 10,000,000 integers)

random_dict('', 50000, charset, 1, 3)
--------------------------------------

charset = 'abcdefghijklmnopqrstuvwxyz0123456789'

15 collisions (8192 buckets, 5083 used=62.0%): hash=1526677a => 77a
20 collisions (32768 buckets, 19098 used=58.3%): hash=5d7760e6 => 60e6
25 collisions => <unable to generate a new key>

random_dict('', 50000, charset, 1, 3)
--------------------------------------

charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.=+_(){}%'

15 collisions (32768 buckets, 20572 used=62.8%): hash=789fe1e6 => 61e6
20 collisions (2048 buckets, 1297 used=63.3%): hash=2052533d => 33d
25 collisions => nope

random_dict('', 50000, charset, 1, 10)
--------------------------------------

charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.=+_(){}%'

15 collisions (32768 buckets, 18964 used=57.9%): hash=94d7c4f5 => 44f5
20 collisions (32768 buckets, 21548 used=65.8%): hash=acb5b39e => 339e
25 collisions (8192 buckets, 5395 used=65.9%): hash=04d367ae => 7ae
30 collisions => nope

random_dict() comes from the following script:
***
import random

def random_string(charset, minlen, maxlen):
    strlen = random.randint(minlen, maxlen)
    return ''.join(random.choice(charset) for index in xrange(strlen))

def random_dict(prefix, count, charset, minlen, maxlen):
    dico = {}
    keys = set()
    for index in xrange(count):
        for tries in xrange(10000):
            key = prefix + random_string(charset, minlen, maxlen)
            if key in keys:
                continue
            keys.add(key)
            break
        else:
            raise ValueError("unable to generate a new key")
        dico[key] = None

charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.=+_(){}%'
charset = 'abcdefghijklmnopqrstuvwxyz0123456789'
random_dict('', 50000, charset, 1, 3)
***

I ran the Django test suite. With a limit of 20 collisions, 60 tests
fail. With a limit of 50 collisions, there is no failure. But I don't
think that the test suite uses large data sets.

I also triend the Django test suite with a randomized hash function.
There are 46 failures. Many (all?) are related to the order of dict
keys: repr(dict) or indirectly in a HTML output. I didn't analyze all
failures. I suppose that Django can simply run the test suite using
PYTHONHASHSEED=0 (disable the randomized hash function), at least in a
first time.
History
Date User Action Args
2012-01-19 13:13:43vstinnersetrecipients: + vstinner, lemburg, gvanrossum, tim.peters, barry, georg.brandl, terry.reedy, gregory.p.smith, jcea, mark.dickinson, pitrou, christian.heimes, benjamin.peterson, eric.araujo, grahamd, Arfrever, v+python, alex, zbysz, skrah, dmalcolm, gz, neologix, Arach, Mark.Shannon, eric.snow, Zhiping.Deng, Huzaifa.Sidhpurwala, Jim.Jewett, PaulMcMillan, fx5
2012-01-19 13:13:42vstinnerlinkissue13703 messages
2012-01-19 13:13:41vstinnercreate