Author jdemeyer
Recipients eric.smith, jdemeyer, mark.dickinson, rhettinger, sir-sigurd, tim.peters
Date 2018-09-28.15:38:52
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1538149133.04.0.545547206417.issue34751@psf.upfronthosting.co.za>
In-reply-to
Content
I spent about 2 days doing an extensive study of the FNV and DJB algorithms. I share my conclusions below.

To be very clear what I mean, I am talking about the following algorithms (t is a tuple and m is the multiplier which is always assumed to be odd).

def FNV(t, m):
    h = 1
    for x in t:
        h = ((h * m) ^ x) % 2**32
    return h

def DJB(t, m):
    h = 1
    for x in t:
        h = ((h * m) + x) % 2**32
    return h

I am restricting to 32 bits because we need to support that anyway and because it's computationally easier than 64 bits.

I took the point of view of fixing two different tuples t and u and then seeing for which multipliers m a collision hash(t, m) == hash(u, m) occurs. Since the lower k bits of the hash depend only on the lower k bits of the multiplier, it is actually feasible to list all m's for which there is a collision; I wrote a C program to do that. Note that there are 2**31 possible odd multipliers and 2**32 different possible hash values: so you expect about 0.5 collisions on average.

For the entries of the tuples, I took random integers in the range from 0 to 63 (so the issue with negative numbers in FNV does not occur). I mostly considered tuples of length 4, 5 and 6.

It turns out that both algorithms have pairs (t, u) for which a massive number of collisions occur:

- For FNV, the tuples (12, 50, 52, 24, 3), (28, 18, 52, 56, 19) have 2**26 collisions (all multipliers with lower byte 0x01, 0x7f, 0x81 or 0xff give collisions)
- For DJB, the tuples (22, 10, 12, 22, 29), (23, 14, 18, 26, 30) have 2**24 collisions (all multipliers with lower byte 0xff give collisions)

However, when you then look at the multipliers for these bad cases, they all have a special structure in the lower bits of which there are 2 cases:

- A special bit pattern like 0x001, 0x003, 0x555, 0xccd, ...
- A zero of a low-degree polynomial like 0x9fa5 which is a zero of x^2 + x + 2 modulo 2**16 (this may only be relevant for DJB)

In order to eliminate these 2 cases, I decided to fix the lower byte of the multiplier. I wanted it to be 3 or 5 mod 8 to have maximal multiplicative order and then I checked which byte had the worst algebraic relations and no obvious bit pattern. I decided to take 0xc5 as lower byte.

When considering only multipliers with 0xc5 as lower byte, I found no very bad cases. I checked lots of pairs of tuples of length <= 6 and entries <= 64 and the worst I could find was 8192 collisions for FNV and 1024 collisions for DJB. This maximum number was found for tuples of length 5. Also, the maximum number of collisions seems bounded as the bitsize increases. For a 25-bit hash, I got the same maximum numbers of collisions as for a 32-bit hash. So, the probability of getting a collision decreases by a factor of 2 for every bit added, which is what you want.

When testing various bounds on the entries and tuple lengths, DJB is generally a little bit better than FNV (by 2 metrics: the maximum number of collisions and the root-mean-square of the number of collisions).

So my conclusion would be to use DJB as core algorithm, taking care to pick a good multiplier (or at least not an obviously bad multiplier).
History
Date User Action Args
2018-09-28 15:38:53jdemeyersetrecipients: + jdemeyer, tim.peters, rhettinger, mark.dickinson, eric.smith, sir-sigurd
2018-09-28 15:38:53jdemeyersetmessageid: <1538149133.04.0.545547206417.issue34751@psf.upfronthosting.co.za>
2018-09-28 15:38:53jdemeyerlinkissue34751 messages
2018-09-28 15:38:52jdemeyercreate