Author vstinner Arach, Arfrever, Huzaifa.Sidhpurwala, Jim.Jewett, Mark.Shannon, PaulMcMillan, Zhiping.Deng, alex, barry, benjamin.peterson, christian.heimes, dmalcolm, eric.araujo, eric.snow, fx5, georg.brandl, grahamd, gregory.p.smith, gvanrossum, gz, jcea, lemburg, mark.dickinson, neologix, pitrou, skrah, terry.reedy, tim.peters, v+python, vstinner, zbysz 2012-01-19.13:13:41 7.9032e-06 No <4F176E88.3090503@udel.edu>
Content
```I tried the collision counting with a low number of collisions:

less than 15 collisions
-----------------------

Fail at startup.

5 collisions (32 buckets, 21 used=65.6%): hash=ceb3152f => f
10 collisions (32 buckets, 21 used=65.6%): hash=ceb3152f => f

dict((str(k), 0) for k in range(2000000))
-----------------------------------------

15 collisions (32,768 buckets, 18024 used=55.0%): hash=0e4631d2 => 31d2
20 collisions (131,072 buckets, 81568 used=62.2%): hash=12660719 => 719
25 collisions (1,048,576 buckets, 643992 used=61.4%): hash=6a1f6d21 => f6d21
30 collisions (1,048,576 buckets, 643992 used=61.4%): hash=6a1f6d21 => f6d21
35 collisions => ? (more than 10,000,000 integers)

random_dict('', 50000, charset, 1, 3)
--------------------------------------

charset = 'abcdefghijklmnopqrstuvwxyz0123456789'

15 collisions (8192 buckets, 5083 used=62.0%): hash=1526677a => 77a
20 collisions (32768 buckets, 19098 used=58.3%): hash=5d7760e6 => 60e6
25 collisions => <unable to generate a new key>

random_dict('', 50000, charset, 1, 3)
--------------------------------------

charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.=+_(){}%'

15 collisions (32768 buckets, 20572 used=62.8%): hash=789fe1e6 => 61e6
20 collisions (2048 buckets, 1297 used=63.3%): hash=2052533d => 33d
25 collisions => nope

random_dict('', 50000, charset, 1, 10)
--------------------------------------

charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.=+_(){}%'

15 collisions (32768 buckets, 18964 used=57.9%): hash=94d7c4f5 => 44f5
20 collisions (32768 buckets, 21548 used=65.8%): hash=acb5b39e => 339e
25 collisions (8192 buckets, 5395 used=65.9%): hash=04d367ae => 7ae
30 collisions => nope

random_dict() comes from the following script:
***
import random

def random_string(charset, minlen, maxlen):
strlen = random.randint(minlen, maxlen)
return ''.join(random.choice(charset) for index in xrange(strlen))

def random_dict(prefix, count, charset, minlen, maxlen):
dico = {}
keys = set()
for index in xrange(count):
for tries in xrange(10000):
key = prefix + random_string(charset, minlen, maxlen)
if key in keys:
continue
break
else:
raise ValueError("unable to generate a new key")
dico[key] = None

charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.=+_(){}%'
charset = 'abcdefghijklmnopqrstuvwxyz0123456789'
random_dict('', 50000, charset, 1, 3)
***

I ran the Django test suite. With a limit of 20 collisions, 60 tests
fail. With a limit of 50 collisions, there is no failure. But I don't
think that the test suite uses large data sets.

I also triend the Django test suite with a randomized hash function.
There are 46 failures. Many (all?) are related to the order of dict
keys: repr(dict) or indirectly in a HTML output. I didn't analyze all
failures. I suppose that Django can simply run the test suite using
PYTHONHASHSEED=0 (disable the randomized hash function), at least in a
first time.```
History
Date User Action Args
2012-01-19 13:13:43vstinnersetrecipients: + vstinner, lemburg, gvanrossum, tim.peters, barry, georg.brandl, terry.reedy, gregory.p.smith, jcea, mark.dickinson, pitrou, christian.heimes, benjamin.peterson, eric.araujo, grahamd, Arfrever, v+python, alex, zbysz, skrah, dmalcolm, gz, neologix, Arach, Mark.Shannon, eric.snow, Zhiping.Deng, Huzaifa.Sidhpurwala, Jim.Jewett, PaulMcMillan, fx5