Message199137
Your benchmark is a bit unrealistic because it times the hash cache most of the time. Here is a better benchmark (but bytes-only):
$ ./python -m timeit -s "words=[w.encode('utf-8') for line in open('../LICENSE') for w in line.split()]; import collections" -- "c = collections.Counter(memoryview(w) for w in words); c.most_common(10)"
1000 loops, best of 3: 1.63 msec per loop
This increases the number of hash calculations from about 28k to over 8.4 mio.
I also added a little statistic function to see how large typical string are. The artificial benchmark:
hash 1: 115185
hash 2: 1440956
hash 3: 1679976
hash 4: 873769
hash 5: 948124
hash 6: 651799
hash 7: 676707
hash 8: 545459
hash 9: 523615
hash 10: 421232
hash 11: 161641
hash 12: 140797
hash 13: 86826
hash 14: 41702
hash 15: 41570
hash 16: 332
hash 17: 211
hash 18: 4275
hash 19: 205
hash 20: 131
hash 21: 4197
hash 22: 70
hash 23: 35
hash 24: 44
hash 25: 4145
hash 26: 4137
hash 27: 4137
hash 28: 21
hash 29: 4124
hash 30: 8
hash 31: 5
hash 32: 1
hash other: 28866
hash total: 8404302
And here is the statistic of a full test run.
hash 1: 18935
hash 2: 596761
hash 3: 643973
hash 4: 645399
hash 5: 576231
hash 6: 742531
hash 7: 497214
hash 8: 330890
hash 9: 291301
hash 10: 93206
hash 11: 1417900
hash 12: 160802
hash 13: 58675
hash 14: 49324
hash 15: 48068
hash 16: 90634
hash 17: 24163
hash 18: 66079
hash 19: 23408
hash 20: 20695
hash 21: 16424
hash 22: 17236
hash 23: 59135
hash 24: 10368
hash 25: 6047
hash 26: 6784
hash 27: 5565
hash 28: 5931
hash 29: 3469
hash 30: 4220
hash 31: 2652
hash 32: 2911
hash other: 72042
hash total: 6608973
About 50% of the hashed elements are between 1 and 5 characters long. A realistic hash collision attack on 32bit needs at least 7, 8 chars. I see the chance of a micro-optimization! :) |
|
Date |
User |
Action |
Args |
2013-10-07 10:57:36 | christian.heimes | set | recipients:
+ christian.heimes, ncoghlan, pitrou |
2013-10-07 10:57:36 | christian.heimes | set | messageid: <1381143456.3.0.948803349956.issue19183@psf.upfronthosting.co.za> |
2013-10-07 10:57:36 | christian.heimes | link | issue19183 messages |
2013-10-07 10:57:35 | christian.heimes | create | |
|