ipaddress: hash similarities for ipv4 and ipv6 #64645

flambda · 2014-01-30T13:02:08Z

BPO	20446
Nosy	@tim-one, @ncoghlan, @pitrou, @socketpair, @MojoVampire

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2014-06-21.01:31:14.838>
created_at = <Date 2014-01-30.13:02:08.064>
labels = ['interpreter-core', 'type-bug', 'invalid']
title = 'ipaddress: hash similarities for ipv4 and ipv6'
updated_at = <Date 2016-07-08.11:03:53.525>
user = 'https://bugs.python.org/flambda'

bugs.python.org fields:

activity = <Date 2016-07-08.11:03:53.525>
actor = 'socketpair'
assignee = 'none'
closed = True
closed_date = <Date 2014-06-21.01:31:14.838>
closer = 'tim.peters'
components = ['Interpreter Core']
creation = <Date 2014-01-30.13:02:08.064>
creator = 'flambda'
dependencies = []
files = []
hgrepos = []
issue_num = 20446
keywords = []
message_count = 9.0
messages = ['209714', '221045', '221060', '221139', '221140', '221141', '221149', '269983', '269985']
nosy_count = 8.0
nosy_names = ['tim.peters', 'ncoghlan', 'pitrou', 'pmoody', 'BreamoreBoy', 'socketpair', 'flambda', 'josh.r']
pr_nums = []
priority = 'normal'
resolution = 'not a bug'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue20446'
versions = ['Python 3.3']

flambda · 2014-01-30T13:02:07Z

Hi,

I am a bit unsure if this is a problem but bear with me:

Using ipaddress I noticed that:
hash(ipaddress.ip_address("0.0.0.1")) == hash(ipaddress.ip_address("::1"))

which makes sense as ipaddress uses this:
(http://hg.python.org/cpython/file/default/Lib/ipaddress.py)
def __hash__(self):
return hash(hex(int(self._ip)))
as the hash.

This means that in a mixed environment (ipv4 and ipv6 addresses), hashtables with hash(hex(int(b))), int(b)<2**32 will be filled more then tables with int(b)>2**32 due to the ipv4 and ipv6 address space overlapping in the first 32bits.

If this is intended behaviour, I am sorry to have bothered you.
Have a nice evening.

BreamoreBoy · 2014-06-19T22:36:48Z

Can someone comment on this issue please.

pitrou · 2014-06-20T01:51:44Z

Unless you can show up a real-world workload where this truely impacts performance, I'd classify this is a rather vague concern.

MojoVampire · 2014-06-21T01:03:56Z

Correct me if I'm wrong, but wouldn't this only become a concern if:

You're storing both IPv4 and IPv6 addresses side-by-side
You're storing well over a billion IP addresses
Hash codes for the hex string of an IP address were predictably sequential (they're not)

On point #3 alone, you can check for yourself. In a quick test within a single process on Python 3.4, hash(hex(0x1)) == 7060637827927985012 while hash(hex(0x2)) == -4275917786525356978 (your numbers may vary thanks to per process string hash seeding, but they should be quite random). As such, you couldn't easily fill more than two sequential buckets reliably; you could guarantee collision chaining occurs at least once (since as you noted, you can create two IP addresses with the same hash reliably), but the chains will be evenly distributed; you can't build on that to get a second, third, ..., nth collision.

There wouldn't be a meaningful "imbalance" between low and high IP addresses either; below a billion or so IP addresses, random chance would dictate the occasional hash code would collide, and you could guarantee that collisions with the sub-32 bit values would collide one extra time before finding an empty bucket, but I seem to recall a typical dict insertion involves 1-5 collisions already; adding one extra to every single dict insertion/lookup costs something, but it's not that much, and the scenarios required to take advantage of it would be incredibly contrived.

tim-one · 2014-06-21T01:31:14Z

I'm more puzzled by why __hash__() here bothers to call hex() at all. It's faster to hash the underlying int as-is, and collisions in this specific context would still be rare.

@josh, note that there's nothing bad about getting sequential hash codes in CPython's implementation of dicts. This is explained in dictobject.c, in the long comment block starting with "Major subtleties ahead:".

In any case, I'm closing this, as nobody has brought up a real problem.

pitrou · 2014-06-21T01:33:03Z

Tim Peters added the comment:

I'm more puzzled by why __hash__() here bothers to call hex() at
all. It's faster to hash the underlying int as-is, and collisions in
this specific context would still be rare.

Let someone provide a patch, then.

MojoVampire · 2014-06-21T03:24:20Z

@tim: Sorry, forgot the internals there (I read all that a while ago, but it's not easy to keep all the details in mind). Apparently there are four reasons this isn't actually a problem. :-)

I agree it's kind of silly that it's converted to a hex string before hashing. Maybe the person writing the code also forgot about sequential hashes being harmless and thought using string hashes (that vary more erratically in modern SIP hash Python) over int hashes would reduce collisions?

socketpair · 2016-07-08T11:03:23Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ipaddress: hash similarities for ipv4 and ipv6 #64645

ipaddress: hash similarities for ipv4 and ipv6 #64645

flambda mannequin commented Jan 30, 2014

flambda mannequin commented Jan 30, 2014

BreamoreBoy mannequin commented Jun 19, 2014

pitrou commented Jun 20, 2014

MojoVampire mannequin commented Jun 21, 2014

tim-one commented Jun 21, 2014

pitrou commented Jun 21, 2014

MojoVampire mannequin commented Jun 21, 2014

socketpair mannequin commented Jul 8, 2016

socketpair mannequin commented Jul 8, 2016

ipaddress: hash similarities for ipv4 and ipv6 #64645

ipaddress: hash similarities for ipv4 and ipv6 #64645

Comments

flambda mannequin commented Jan 30, 2014

flambda mannequin commented Jan 30, 2014

BreamoreBoy mannequin commented Jun 19, 2014

pitrou commented Jun 20, 2014

MojoVampire mannequin commented Jun 21, 2014

tim-one commented Jun 21, 2014

pitrou commented Jun 21, 2014

MojoVampire mannequin commented Jun 21, 2014

socketpair mannequin commented Jul 8, 2016

socketpair mannequin commented Jul 8, 2016