Issue 20446: ipaddress: hash similarities for ipv4 and ipv6

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/64645

classification

Title:	ipaddress: hash similarities for ipv4 and ipv6
Type:	behavior	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.3

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	BreamoreBoy, flambda, josh.r, ncoghlan, pitrou, pmoody, socketpair, tim.peters
Priority:	normal	Keywords:

Created on 2014-01-30 13:02 by flambda, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (9)
msg209714 - (view)	Author: flambda (flambda)	Date: 2014-01-30 13:02
Hi, I am a bit unsure if this is a problem but bear with me: Using ipaddress I noticed that: hash(ipaddress.ip_address("0.0.0.1")) == hash(ipaddress.ip_address("::1")) which makes sense as ipaddress uses this: (http://hg.python.org/cpython/file/default/Lib/ipaddress.py) def __hash__(self): return hash(hex(int(self._ip))) as the hash. This means that in a mixed environment (ipv4 and ipv6 addresses), hashtables with hash(hex(int(b))), int(b)<232 will be filled more then tables with int(b)>232 due to the ipv4 and ipv6 address space overlapping in the first 32bits. If this is intended behaviour, I am sorry to have bothered you. Have a nice evening.
msg221045 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2014-06-19 22:36
Can someone comment on this issue please.
msg221060 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-06-20 01:51
Unless you can show up a real-world workload where this truely impacts performance, I'd classify this is a rather vague concern.
msg221139 - (view)	Author: Josh Rosenberg (josh.r) *	Date: 2014-06-21 01:03
Correct me if I'm wrong, but wouldn't this only become a concern if: 1. You're storing both IPv4 and IPv6 addresses side-by-side 2. You're storing well over a billion IP addresses 3. Hash codes for the hex string of an IP address were predictably sequential (they're not) On point #3 alone, you can check for yourself. In a quick test within a single process on Python 3.4, hash(hex(0x1)) == 7060637827927985012 while hash(hex(0x2)) == -4275917786525356978 (your numbers may vary thanks to per process string hash seeding, but they should be quite random). As such, you couldn't easily fill more than two sequential buckets reliably; you could guarantee collision chaining occurs at least once (since as you noted, you can create two IP addresses with the same hash reliably), but the chains will be evenly distributed; you can't build on that to get a second, third, ..., nth collision. There wouldn't be a meaningful "imbalance" between low and high IP addresses either; below a billion or so IP addresses, random chance would dictate the occasional hash code would collide, and you could guarantee that collisions with the sub-32 bit values would collide one extra time before finding an empty bucket, but I seem to recall a typical dict insertion involves 1-5 collisions already; adding one extra to every single dict insertion/lookup costs something, but it's not that much, and the scenarios required to take advantage of it would be incredibly contrived.
msg221140 - (view)	Author: Tim Peters (tim.peters) *	Date: 2014-06-21 01:31
I'm more puzzled by why `__hash__()` here bothers to call `hex()` at all. It's faster to hash the underlying int as-is, and collisions in this specific context would still be rare. @Josh, note that there's nothing bad about getting sequential hash codes in CPython's implementation of dicts. This is explained in dictobject.c, in the long comment block starting with "Major subtleties ahead:". In any case, I'm closing this, as nobody has brought up a real problem.
msg221141 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-06-21 01:33
> Tim Peters added the comment: > > I'm more puzzled by why `__hash__()` here bothers to call `hex()` at all. It's faster to hash the underlying int as-is, and collisions in this specific context would still be rare. Let someone provide a patch, then.
msg221149 - (view)	Author: Josh Rosenberg (josh.r) *	Date: 2014-06-21 03:24
@Tim: Sorry, forgot the internals there (I read all that a while ago, but it's not easy to keep all the details in mind). Apparently there are four reasons this isn't actually a problem. :-) I agree it's kind of silly that it's converted to a hex string before hashing. Maybe the person writing the code also forgot about sequential hashes being harmless and thought using string hashes (that vary more erratically in modern SIP hash Python) over int hashes would reduce collisions?
msg269983 - (view)	Author: Марк Коренберг (socketpair) *	Date: 2016-07-08 11:03
See also issue20446
msg269985 - (view)	Author: Марк Коренберг (socketpair) *	Date: 2016-07-08 11:03
Sorry. See also issue27269

History
Date	User	Action	Args
2022-04-11 14:57:57	admin	set	github: 64645
2016-07-08 11:03:53	socketpair	set	messages: + msg269985
2016-07-08 11:03:23	socketpair	set	nosy: + socketpair messages: + msg269983
2014-06-21 03:24:20	josh.r	set	messages: + msg221149
2014-06-21 01:33:02	pitrou	set	messages: + msg221141
2014-06-21 01:31:14	tim.peters	set	status: open -> closed nosy: + tim.peters messages: + msg221140 resolution: not a bug stage: resolved
2014-06-21 01:03:57	josh.r	set	nosy: + josh.r messages: + msg221139
2014-06-20 01:51:43	pitrou	set	nosy: + pitrou messages: + msg221060
2014-06-19 23:02:26	ned.deily	set	nosy: + ncoghlan, pmoody
2014-06-19 22:36:47	BreamoreBoy	set	nosy: + BreamoreBoy messages: + msg221045
2014-01-30 13:02:08	flambda	create