Message 214565 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	exhuma
Recipients	exhuma, ncoghlan, pmoody
Date	2014-03-23.12:12:12
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1395576734.28.0.345886836751.issue20826@psf.upfronthosting.co.za>
In-reply-to

Content
Sorry for the late reply. I wanted to take some time and give a more detailed explanation. At least to the best of my abilities :) I attached a zip-file with my quick-and-dirty test-rig. The zip contains: * gendata.py -- The script I used to generate test-data * testdata.lst -- My test-data set (for reproducability) * tester.py -- A simple script using ``timeit.timeit``. I am not sure how sensitive the data is I am working with, so I prefer not to put any of the real data on a public forum. Instead, I wrote a small script which generates a data-set which makes the performance difference visible (``gendata.py``). The data which I processed actually created an even worse case, but it's difficult to reproduce. In my case, the data-set I used is in the file named ``testdata.lst``. I then ran the operation 5 times using ``timeit`` (tester.py). Let me also outline an explanation to what it happening: It is possible, that through one "merge" operation, a second may become possible. For the sake of readability, let's use IPv4 addresses, and consider the following list: [a_1, a_2, ..., a_n, 192.168.1.0/31, 192.168.1.2/32, 192.168.1.3/32, b_1, b_2, ..., b_n] This can be reduced to [a_1, a_2, ..., a_n, 192.168.1.0/31, 192.168.1.2/31, b_1, b_2, ..., b_n] Which in turn can then be reduced to: [a_1, a_2, ..., a_n, 192.168.1.0/30, b_1, b_2, ..., b_n] The current implementation, sets a boolean (``optimized``) to ``True`` if any merge has been performed. If yes, it re-runs through the whole list until no optimisation is done. Those re-runs also include [a1..an] and [b1..bn], which is unnecessary. With the given data-set, this gives the following result: Execution time: 48.27790632040014 seconds ./python tester.py 244.29s user 0.06s system 99% cpu 4:04.51 total With the shift/reduce approach, only as many nodes are visited as necessary. If a "reduce" is made, it "backtracks" as much as possible, but not further. So in the above example, nodes [a1..an] will only be visited once, and [b1..bn] will only be visited once the complete optimisation of the example addresses has been performed. With the given data-set, this gives the following result: Execution time: 20.298685277199912 seconds ./python tester.py 104.20s user 0.14s system 99% cpu 1:44.58 total If my thoughts are correct, both implementations should have a similar "best-case", but the "worst-case" differs significantly. I am not well-versed with the Big-O notation, especially the best/average/worst case difference. Neither am I good at math. But I believe both are strictly speaking O(n). But more precisely, O(k*n) where k is proportional the number of reconciliation steps needed (that is, if one merger makes another merger possible). But it is much smaller in the shift/reduce approach as only as many elements need to be revisited as necessary, instead of all of them.

Sorry for the late reply. I wanted to take some time and give a more detailed explanation. At least to the best of my abilities :)

I attached a zip-file with my quick-and-dirty test-rig. The zip contains:

 * gendata.py -- The script I used to generate test-data
 * testdata.lst -- My test-data set (for reproducability)
 * tester.py -- A simple script using ``timeit.timeit``.

I am not sure how sensitive the data is I am working with, so I prefer not to put any of the real data on a public forum. Instead, I wrote a small script which generates a data-set which makes the performance difference visible (``gendata.py``). The data which I processed actually created an even worse case, but it's difficult to reproduce. In my case, the data-set I used is in the file named ``testdata.lst``.

I then ran the operation 5 times using ``timeit`` (tester.py).

Let me also outline an explanation to what it happening:

It is possible, that through one "merge" operation, a second may become possible. For the sake of readability, let's use IPv4 addresses, and consider the following list:

    [a_1, a_2, ..., a_n, 192.168.1.0/31, 192.168.1.2/32, 192.168.1.3/32, b_1, b_2, ..., b_n]

This can be reduced to

    [a_1, a_2, ..., a_n, 192.168.1.0/31, 192.168.1.2/31, b_1, b_2, ..., b_n]
	
Which in turn can then be reduced to:

    [a_1, a_2, ..., a_n, 192.168.1.0/30, b_1, b_2, ..., b_n]
	
The current implementation, sets a boolean (``optimized``) to ``True`` if any merge has been performed. If yes, it re-runs through the whole list until no optimisation is done. Those re-runs also include [a1..an] and [b1..bn], which is unnecessary. With the given data-set, this gives the following result:

    Execution time: 48.27790632040014 seconds
    ./python tester.py  244.29s user 0.06s system 99% cpu 4:04.51 total

With the shift/reduce approach, only as many nodes are visited as necessary. If a "reduce" is made, it "backtracks" as much as possible, but not further. So in the above example, nodes [a1..an] will only be visited once, and [b1..bn] will only be visited once the complete optimisation of the example addresses has been performed. With the given data-set, this gives the following result:

    Execution time: 20.298685277199912 seconds
    ./python tester.py  104.20s user 0.14s system 99% cpu 1:44.58 total
	
If my thoughts are correct, both implementations should have a similar "best-case", but the "worst-case" differs significantly. I am not well-versed with the Big-O notation, especially the best/average/worst case difference. Neither am I good at math. But I believe both are strictly speaking O(n). But more precisely, O(k*n) where k is proportional the number of reconciliation steps needed (that is, if one merger makes another merger possible). But it is much smaller in the shift/reduce approach as only as many elements need to be revisited as necessary, instead of all of them.

History
Date	User	Action	Args
2014-03-23 12:12:14	exhuma	set	recipients: + exhuma, ncoghlan, pmoody
2014-03-23 12:12:14	exhuma	set	messageid: <1395576734.28.0.345886836751.issue20826@psf.upfronthosting.co.za>
2014-03-23 12:12:14	exhuma	link	issue20826 messages
2014-03-23 12:12:12	exhuma	create