classification
Title: hash collision in instances of ipaddress.ip_network
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.4, Python 3.5, Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: Francois Schneider, mark.dickinson, serhiy.storchaka
Priority: normal Keywords:

Created on 2018-06-06 15:29 by Francois Schneider, last changed 2018-06-08 07:49 by mark.dickinson. This issue is now closed.

Messages (6)
msg318835 - (view) Author: Francois Schneider (Francois Schneider) Date: 2018-06-06 15:29
>>> import ipaddress
>>> hash(ipaddress.ip_network(u'20.0.2.3/32')) == hash(ipaddress.ip_network(u'20.0.2.0/30'))
True
msg318901 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2018-06-07 06:58
This shouldn't be a problem: there's no rule that says that different objects should have different hashes. Indeed, with a countable infinity of possible different hashable inputs, a deterministic hashing algorithm, and only finitely many outputs, such a rule would be a mathematical impossibility. For example:

>>> hash(-1) == hash(-2)
True

Are these hash collisions causing real issues in your code? While a single hash collision like this shouldn't be an issue, if there are many collisions within a single (non-artificial) dataset, that _can_ lead to performance issues.

Looking at the code, we could probably do a better job of making the hash collisions less predictable. The current code looks like:

    def __hash__(self):
        return hash(int(self.network_address) ^ int(self.netmask))

I'd propose hashing a tuple instead of using the xor. For example:

    def __hash__(self):
        return hash((int(self.network_address), int(self.netmask)))

Hash collisions would almost certainly still occur with this scheme, but they'd be a tiny bit less obvious and harder to find.
msg318904 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-06-07 07:07
Concur with Mark.
msg318905 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2018-06-07 07:09
> I'd propose hashing a tuple instead of using the xor.

To be clear; I'd propose this _only_ if there's evidence that the current hashing scheme is unsatisfactory for real-world use-cases. Otherwise, on an "if it's not broken, don't fix it" basis, I don't think there's any change to be made here.
msg318976 - (view) Author: Francois Schneider (Francois Schneider) Date: 2018-06-07 22:17
Thanks for the analysis, I agree completely.

Actually the problem was coming from my code where one of the __eq__ method was implemented like this:
>>> def __eq__(self, other):
>>>   return hash(self) == hash(other)

so 2 instances with only a slight difference in their ip_network attribute (ip_network(u'20.0.2.3/32') and ip_network(u'20.0.2.0/30')) were having the same hash and being equal -> they could not be inserted both in the same collection.

I will just rewrite my __eq__ method properly.
msg319030 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2018-06-08 07:49
Thanks, Francois. Indeed, comparing hashes isn't a good way to implement equality. :-)  Closing here.
History
Date User Action Args
2018-06-08 07:49:35mark.dickinsonsetstatus: open -> closed

messages: + msg319030
stage: resolved
2018-06-07 22:17:51Francois Schneidersetstatus: pending -> open

messages: + msg318976
2018-06-07 07:20:14mark.dickinsonsetstatus: open -> pending
resolution: not a bug
2018-06-07 07:10:44mark.dickinsonsetnosy: + serhiy.storchaka
2018-06-07 07:09:34mark.dickinsonsetnosy: - serhiy.storchaka
messages: + msg318905
2018-06-07 07:07:43serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg318904
2018-06-07 06:58:40mark.dickinsonsetnosy: + mark.dickinson
messages: + msg318901
2018-06-06 15:32:09Francois Schneidersetversions: + Python 3.4, Python 3.5
2018-06-06 15:29:28Francois Schneidercreate