Title: ipaddress: hash similarities for ipv4 and ipv6
Type: behavior Stage: resolved
Components: Interpreter Core Versions: Python 3.3
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, flambda, josh.r, ncoghlan, pitrou, pmoody, socketpair, tim.peters
Priority: normal Keywords:

Created on 2014-01-30 13:02 by flambda, last changed 2016-07-08 11:03 by socketpair. This issue is now closed.

Messages (9)
msg209714 - (view) Author: flambda (flambda) Date: 2014-01-30 13:02

I am a bit unsure if this is a problem but bear with me:

Using ipaddress I noticed that: 
hash(ipaddress.ip_address("")) == hash(ipaddress.ip_address("::1"))

which makes sense as ipaddress uses this:
    def __hash__(self):
        return hash(hex(int(self._ip)))
as the hash.

This means that in a mixed environment (ipv4 and ipv6 addresses), hashtables with hash(hex(int(b))), int(b)<2**32 will be filled more then tables with int(b)>2**32 due to the ipv4 and ipv6 address space overlapping in the first 32bits.

If this is intended behaviour, I am sorry to have bothered you.
Have a nice evening.
msg221045 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014-06-19 22:36
Can someone comment on this issue please.
msg221060 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-06-20 01:51
Unless you can show up a real-world workload where this truely impacts performance, I'd classify this is a rather vague concern.
msg221139 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2014-06-21 01:03
Correct me if I'm wrong, but wouldn't this only become a concern if:

1. You're storing both IPv4 and IPv6 addresses side-by-side
2. You're storing well over a billion IP addresses
3. Hash codes for the hex string of an IP address were predictably sequential (they're not)

On point #3 alone, you can check for yourself. In a quick test within a single process on Python 3.4, hash(hex(0x1)) == 7060637827927985012 while hash(hex(0x2)) == -4275917786525356978 (your numbers may vary thanks to per process string hash seeding, but they should be quite random). As such, you couldn't easily fill more than two sequential buckets reliably; you could guarantee collision chaining occurs at least once (since as you noted, you can create two IP addresses with the same hash reliably), but the chains will be evenly distributed; you can't build on that to get a second, third, ..., nth collision.

There wouldn't be a meaningful "imbalance" between low and high IP addresses either; below a billion or so IP addresses, random chance would dictate the occasional hash code would collide, and you could guarantee that collisions with the sub-32 bit values would collide one extra time before finding an empty bucket, but I seem to recall a typical dict insertion involves 1-5 collisions already; adding one extra to every single dict insertion/lookup costs something, but it's not that much, and the scenarios required to take advantage of it would be incredibly contrived.
msg221140 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2014-06-21 01:31
I'm more puzzled by why `__hash__()` here bothers to call `hex()` at all.  It's faster to hash the underlying int as-is, and collisions in this specific context would still be rare.

@Josh, note that there's nothing bad about getting sequential hash codes in CPython's implementation of dicts.  This is explained in dictobject.c, in the long comment block starting with "Major subtleties ahead:".

In any case, I'm closing this, as nobody has brought up a real problem.
msg221141 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-06-21 01:33
> Tim Peters added the comment:
> I'm more puzzled by why `__hash__()` here bothers to call `hex()` at
all. It's faster to hash the underlying int as-is, and collisions in
this specific context would still be rare.

Let someone provide a patch, then.
msg221149 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2014-06-21 03:24
@Tim: Sorry, forgot the internals there (I read all that a while ago, but it's not easy to keep all the details in mind). Apparently there are four reasons this isn't actually a problem. :-)

I agree it's kind of silly that it's converted to a hex string before hashing. Maybe the person writing the code also forgot about sequential hashes being harmless and thought using string hashes (that vary more erratically in modern SIP hash Python) over int hashes would reduce collisions?
msg269983 - (view) Author: Марк Коренберг (socketpair) * Date: 2016-07-08 11:03
See also issue20446
msg269985 - (view) Author: Марк Коренберг (socketpair) * Date: 2016-07-08 11:03
Sorry. See also issue27269
Date User Action Args
2016-07-08 11:03:53socketpairsetmessages: + msg269985
2016-07-08 11:03:23socketpairsetnosy: + socketpair
messages: + msg269983
2014-06-21 03:24:20josh.rsetmessages: + msg221149
2014-06-21 01:33:02pitrousetmessages: + msg221141
2014-06-21 01:31:14tim.peterssetstatus: open -> closed

nosy: + tim.peters
messages: + msg221140

resolution: not a bug
stage: resolved
2014-06-21 01:03:57josh.rsetnosy: + josh.r
messages: + msg221139
2014-06-20 01:51:43pitrousetnosy: + pitrou
messages: + msg221060
2014-06-19 23:02:26ned.deilysetnosy: + ncoghlan, pmoody
2014-06-19 22:36:47BreamoreBoysetnosy: + BreamoreBoy
messages: + msg221045
2014-01-30 13:02:08flambdacreate