Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipaddress: hash similarities for ipv4 and ipv6 #64645

Closed
flambda mannequin opened this issue Jan 30, 2014 · 9 comments
Closed

ipaddress: hash similarities for ipv4 and ipv6 #64645

flambda mannequin opened this issue Jan 30, 2014 · 9 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error

Comments

@flambda
Copy link
Mannequin

flambda mannequin commented Jan 30, 2014

BPO 20446
Nosy @tim-one, @ncoghlan, @pitrou, @socketpair, @MojoVampire

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2014-06-21.01:31:14.838>
created_at = <Date 2014-01-30.13:02:08.064>
labels = ['interpreter-core', 'type-bug', 'invalid']
title = 'ipaddress: hash similarities for ipv4 and ipv6'
updated_at = <Date 2016-07-08.11:03:53.525>
user = 'https://bugs.python.org/flambda'

bugs.python.org fields:

activity = <Date 2016-07-08.11:03:53.525>
actor = 'socketpair'
assignee = 'none'
closed = True
closed_date = <Date 2014-06-21.01:31:14.838>
closer = 'tim.peters'
components = ['Interpreter Core']
creation = <Date 2014-01-30.13:02:08.064>
creator = 'flambda'
dependencies = []
files = []
hgrepos = []
issue_num = 20446
keywords = []
message_count = 9.0
messages = ['209714', '221045', '221060', '221139', '221140', '221141', '221149', '269983', '269985']
nosy_count = 8.0
nosy_names = ['tim.peters', 'ncoghlan', 'pitrou', 'pmoody', 'BreamoreBoy', 'socketpair', 'flambda', 'josh.r']
pr_nums = []
priority = 'normal'
resolution = 'not a bug'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue20446'
versions = ['Python 3.3']

@flambda
Copy link
Mannequin Author

flambda mannequin commented Jan 30, 2014

Hi,

I am a bit unsure if this is a problem but bear with me:

Using ipaddress I noticed that:
hash(ipaddress.ip_address("0.0.0.1")) == hash(ipaddress.ip_address("::1"))

which makes sense as ipaddress uses this:
(http://hg.python.org/cpython/file/default/Lib/ipaddress.py)
def __hash__(self):
return hash(hex(int(self._ip)))
as the hash.

This means that in a mixed environment (ipv4 and ipv6 addresses), hashtables with hash(hex(int(b))), int(b)<2**32 will be filled more then tables with int(b)>2**32 due to the ipv4 and ipv6 address space overlapping in the first 32bits.

If this is intended behaviour, I am sorry to have bothered you.
Have a nice evening.

@flambda flambda mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error labels Jan 30, 2014
@BreamoreBoy
Copy link
Mannequin

BreamoreBoy mannequin commented Jun 19, 2014

Can someone comment on this issue please.

@pitrou
Copy link
Member

pitrou commented Jun 20, 2014

Unless you can show up a real-world workload where this truely impacts performance, I'd classify this is a rather vague concern.

@MojoVampire
Copy link
Mannequin

MojoVampire mannequin commented Jun 21, 2014

Correct me if I'm wrong, but wouldn't this only become a concern if:

  1. You're storing both IPv4 and IPv6 addresses side-by-side
  2. You're storing well over a billion IP addresses
  3. Hash codes for the hex string of an IP address were predictably sequential (they're not)

On point #3 alone, you can check for yourself. In a quick test within a single process on Python 3.4, hash(hex(0x1)) == 7060637827927985012 while hash(hex(0x2)) == -4275917786525356978 (your numbers may vary thanks to per process string hash seeding, but they should be quite random). As such, you couldn't easily fill more than two sequential buckets reliably; you could guarantee collision chaining occurs at least once (since as you noted, you can create two IP addresses with the same hash reliably), but the chains will be evenly distributed; you can't build on that to get a second, third, ..., nth collision.

There wouldn't be a meaningful "imbalance" between low and high IP addresses either; below a billion or so IP addresses, random chance would dictate the occasional hash code would collide, and you could guarantee that collisions with the sub-32 bit values would collide one extra time before finding an empty bucket, but I seem to recall a typical dict insertion involves 1-5 collisions already; adding one extra to every single dict insertion/lookup costs something, but it's not that much, and the scenarios required to take advantage of it would be incredibly contrived.

@tim-one
Copy link
Member

tim-one commented Jun 21, 2014

I'm more puzzled by why __hash__() here bothers to call hex() at all. It's faster to hash the underlying int as-is, and collisions in this specific context would still be rare.

@josh, note that there's nothing bad about getting sequential hash codes in CPython's implementation of dicts. This is explained in dictobject.c, in the long comment block starting with "Major subtleties ahead:".

In any case, I'm closing this, as nobody has brought up a real problem.

@pitrou
Copy link
Member

pitrou commented Jun 21, 2014

Tim Peters added the comment:

I'm more puzzled by why __hash__() here bothers to call hex() at
all. It's faster to hash the underlying int as-is, and collisions in
this specific context would still be rare.

Let someone provide a patch, then.

@MojoVampire
Copy link
Mannequin

MojoVampire mannequin commented Jun 21, 2014

@tim: Sorry, forgot the internals there (I read all that a while ago, but it's not easy to keep all the details in mind). Apparently there are four reasons this isn't actually a problem. :-)

I agree it's kind of silly that it's converted to a hex string before hashing. Maybe the person writing the code also forgot about sequential hashes being harmless and thought using string hashes (that vary more erratically in modern SIP hash Python) over int hashes would reduce collisions?

@socketpair
Copy link
Mannequin

socketpair mannequin commented Jul 8, 2016

See also bpo-20446

@socketpair
Copy link
Mannequin

socketpair mannequin commented Jul 8, 2016

Sorry. See also bpo-27269

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

2 participants