Author cmn
Recipients cmn, pitrou, pmoody, python-dev, serhiy.storchaka
Date 2015-01-20.23:48:27
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1421797707.43.0.446694550251.issue23266@psf.upfronthosting.co.za>
In-reply-to
Content
Eleminating duplicates before processing is faster once the overhead of the set operation is less than the time required to sort the larger dataset with duplicates.

So we are basically comparing sort(data) to sort(set(data)).
The optimum depends on the input data.

python3 -m timeit -s "import random; import bipaddress; ips = [bipaddress.ip_address('2001:db8::') + i for i in range(100000)]; random.shuffle(ips)" -- "bipaddress.collapse_addresses(ips)"

10 loops, best of 3: 1.49 sec per loop
vs.
10 loops, best of 3: 1.59 sec per loop

If the data is pre-sorted, possible if you retrieve from database, things are drastically different:

python3 -m timeit -s "import random; import bipaddress; ips = [bipaddress.ip_address('2001:db8::') + i for i in range(100000)]; " -- "bipaddress.collapse_addresses(ips)"
10 loops, best of 3: 136 msec per loop
vs
10 loops, best of 3: 1.57 sec per loop

So for my usecase, I basically have less than 0.1% duplicates (if at all), dropping the set would be better, but ... other usecases will exist.

Still, it is easy to "emulate" the use of "sorted(set())" from a users perspective - just call collapse_addresses(set(data)) in case you expect to have duplicates and experience a speedup by inserting unique, possibly even sorted, data.

On the other hand, if you have a huge load of 99.99% sorted non collapseable addresses, it is not possible to drop the set() operation in your sorted(set()) from a users perspective, no way to speed things up, and the slowdown you get is x10.

That said, I'd drop the set().
Optimization depends on data input, dropping the set() allows the user to optimize base on the nature of his input data.
History
Date User Action Args
2015-01-20 23:48:27cmnsetrecipients: + cmn, pitrou, pmoody, python-dev, serhiy.storchaka
2015-01-20 23:48:27cmnsetmessageid: <1421797707.43.0.446694550251.issue23266@psf.upfronthosting.co.za>
2015-01-20 23:48:27cmnlinkissue23266 messages
2015-01-20 23:48:27cmncreate