This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author christian.heimes
Recipients christian.heimes, ncoghlan, pitrou
Date 2013-10-07.10:57:35
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1381143456.3.0.948803349956.issue19183@psf.upfronthosting.co.za>
In-reply-to
Content
Your benchmark is a bit unrealistic because it times the hash cache most of the time. Here is a better benchmark (but bytes-only):

$ ./python -m timeit -s "words=[w.encode('utf-8') for line in open('../LICENSE') for w in line.split()]; import collections" -- "c = collections.Counter(memoryview(w) for w in words); c.most_common(10)"
1000 loops, best of 3: 1.63 msec per loop

This increases the number of hash calculations from about 28k to over 8.4 mio. 

I also added a little statistic function to see how large typical string are. The artificial benchmark: 

hash     1:    115185
hash     2:   1440956
hash     3:   1679976
hash     4:    873769
hash     5:    948124
hash     6:    651799
hash     7:    676707
hash     8:    545459
hash     9:    523615
hash    10:    421232
hash    11:    161641
hash    12:    140797
hash    13:     86826
hash    14:     41702
hash    15:     41570
hash    16:       332
hash    17:       211
hash    18:      4275
hash    19:       205
hash    20:       131
hash    21:      4197
hash    22:        70
hash    23:        35
hash    24:        44
hash    25:      4145
hash    26:      4137
hash    27:      4137
hash    28:        21
hash    29:      4124
hash    30:         8
hash    31:         5
hash    32:         1
hash other:     28866
hash total:   8404302

And here is the statistic of a full test run.

hash     1:     18935
hash     2:    596761
hash     3:    643973
hash     4:    645399
hash     5:    576231
hash     6:    742531
hash     7:    497214
hash     8:    330890
hash     9:    291301
hash    10:     93206
hash    11:   1417900
hash    12:    160802
hash    13:     58675
hash    14:     49324
hash    15:     48068
hash    16:     90634
hash    17:     24163
hash    18:     66079
hash    19:     23408
hash    20:     20695
hash    21:     16424
hash    22:     17236
hash    23:     59135
hash    24:     10368
hash    25:      6047
hash    26:      6784
hash    27:      5565
hash    28:      5931
hash    29:      3469
hash    30:      4220
hash    31:      2652
hash    32:      2911
hash other:     72042
hash total:   6608973

About 50% of the hashed elements are between 1 and 5 characters long. A realistic hash collision attack on 32bit needs at least 7, 8 chars. I see the chance of a micro-optimization! :)
History
Date User Action Args
2013-10-07 10:57:36christian.heimessetrecipients: + christian.heimes, ncoghlan, pitrou
2013-10-07 10:57:36christian.heimessetmessageid: <1381143456.3.0.948803349956.issue19183@psf.upfronthosting.co.za>
2013-10-07 10:57:36christian.heimeslinkissue19183 messages
2013-10-07 10:57:35christian.heimescreate