Message 115335 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	terry.reedy
Recipients	LambertDW, barry, eli.bendersky, georg.brandl, ggenellina, gjb1002, hagna, janpf, jimjjewett, mrotondo, pitrou, r.david.murray, rtvd, sjmachin, terry.reedy, tim.peters, vbr
Date	2010-09-01.21:32:58
SpamBayes Score	2.5854874e-06
Marked as misclassified	No
Message-id	<1283376780.15.0.831048812101.issue2986@psf.upfronthosting.co.za>
In-reply-to

Content
While refactoring the code for 2.7, I discovered that the description of the heuristic for 2.6 and in the code comments is off by 1. "items that appear more than 1% of the time" should actually be "items whose duplicates (after the first) appear more than 1% of the time". The discrepancy arises because in the following code for i, elt in enumerate(b): if elt in b2j: indices = b2j[elt] if n >= 200 and len(indices) * 100 > n: populardict[elt] = 1 del indices[:] else: indices.append(i) else: b2j[elt] = [i] len(indices) is retrieved before the index i of the current elt is added. Whatever one might think the heuristic 'should' have been (and by the nature of heuristics, there is no right answer), the default behavior must remain as it is, so we adjusted the code and doc to match that.

While refactoring the code for 2.7, I discovered that the description of the heuristic for 2.6 and in the code comments is off by 1. "items that appear more than 1% of the time" should actually be "items whose duplicates (after the first) appear more than 1% of the time". The discrepancy arises because in the following code

        for i, elt in enumerate(b):
            if elt in b2j:
                indices = b2j[elt]
                if n >= 200 and len(indices) * 100 > n:
                    populardict[elt] = 1
                    del indices[:]
                else:
                    indices.append(i)
            else:
                b2j[elt] = [i]

len(indices) is retrieved *before* the index i of the current elt is added. Whatever one might think the heuristic 'should' have been (and by the nature of heuristics, there is no right answer), the default behavior must remain as it is, so we adjusted the code and doc to match that.

History
Date	User	Action	Args
2010-09-01 21:33:00	terry.reedy	set	recipients: + terry.reedy, tim.peters, barry, georg.brandl, jimjjewett, sjmachin, gjb1002, ggenellina, pitrou, rtvd, vbr, LambertDW, hagna, r.david.murray, eli.bendersky, janpf, mrotondo
2010-09-01 21:33:00	terry.reedy	set	messageid: <1283376780.15.0.831048812101.issue2986@psf.upfronthosting.co.za>
2010-09-01 21:32:58	terry.reedy	link	issue2986 messages
2010-09-01 21:32:58	terry.reedy	create