Message151620
I tried the collision counting with a low number of collisions:
less than 15 collisions
-----------------------
Fail at startup.
5 collisions (32 buckets, 21 used=65.6%): hash=ceb3152f => f
10 collisions (32 buckets, 21 used=65.6%): hash=ceb3152f => f
dict((str(k), 0) for k in range(2000000))
-----------------------------------------
15 collisions (32,768 buckets, 18024 used=55.0%): hash=0e4631d2 => 31d2
20 collisions (131,072 buckets, 81568 used=62.2%): hash=12660719 => 719
25 collisions (1,048,576 buckets, 643992 used=61.4%): hash=6a1f6d21 => f6d21
30 collisions (1,048,576 buckets, 643992 used=61.4%): hash=6a1f6d21 => f6d21
35 collisions => ? (more than 10,000,000 integers)
random_dict('', 50000, charset, 1, 3)
--------------------------------------
charset = 'abcdefghijklmnopqrstuvwxyz0123456789'
15 collisions (8192 buckets, 5083 used=62.0%): hash=1526677a => 77a
20 collisions (32768 buckets, 19098 used=58.3%): hash=5d7760e6 => 60e6
25 collisions => <unable to generate a new key>
random_dict('', 50000, charset, 1, 3)
--------------------------------------
charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.=+_(){}%'
15 collisions (32768 buckets, 20572 used=62.8%): hash=789fe1e6 => 61e6
20 collisions (2048 buckets, 1297 used=63.3%): hash=2052533d => 33d
25 collisions => nope
random_dict('', 50000, charset, 1, 10)
--------------------------------------
charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.=+_(){}%'
15 collisions (32768 buckets, 18964 used=57.9%): hash=94d7c4f5 => 44f5
20 collisions (32768 buckets, 21548 used=65.8%): hash=acb5b39e => 339e
25 collisions (8192 buckets, 5395 used=65.9%): hash=04d367ae => 7ae
30 collisions => nope
random_dict() comes from the following script:
***
import random
def random_string(charset, minlen, maxlen):
strlen = random.randint(minlen, maxlen)
return ''.join(random.choice(charset) for index in xrange(strlen))
def random_dict(prefix, count, charset, minlen, maxlen):
dico = {}
keys = set()
for index in xrange(count):
for tries in xrange(10000):
key = prefix + random_string(charset, minlen, maxlen)
if key in keys:
continue
keys.add(key)
break
else:
raise ValueError("unable to generate a new key")
dico[key] = None
charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.=+_(){}%'
charset = 'abcdefghijklmnopqrstuvwxyz0123456789'
random_dict('', 50000, charset, 1, 3)
***
I ran the Django test suite. With a limit of 20 collisions, 60 tests
fail. With a limit of 50 collisions, there is no failure. But I don't
think that the test suite uses large data sets.
I also triend the Django test suite with a randomized hash function.
There are 46 failures. Many (all?) are related to the order of dict
keys: repr(dict) or indirectly in a HTML output. I didn't analyze all
failures. I suppose that Django can simply run the test suite using
PYTHONHASHSEED=0 (disable the randomized hash function), at least in a
first time. |
|
Date |
User |
Action |
Args |
2012-01-19 13:13:43 | vstinner | set | recipients:
+ vstinner, lemburg, gvanrossum, tim.peters, barry, georg.brandl, terry.reedy, gregory.p.smith, jcea, mark.dickinson, pitrou, christian.heimes, benjamin.peterson, eric.araujo, grahamd, Arfrever, v+python, alex, zbysz, skrah, dmalcolm, gz, neologix, Arach, Mark.Shannon, eric.snow, Zhiping.Deng, Huzaifa.Sidhpurwala, Jim.Jewett, PaulMcMillan, fx5 |
2012-01-19 13:13:42 | vstinner | link | issue13703 messages |
2012-01-19 13:13:41 | vstinner | create | |
|