Author terry.reedy
Recipients LambertDW, barry, eli.bendersky, georg.brandl, ggenellina, gjb1002, hagna, janpf, jimjjewett, mrotondo, pitrou, r.david.murray, rtvd, sjmachin, terry.reedy, tim.peters, vbr
Date 2010-09-01.21:32:58
SpamBayes Score 2.58549e-06
Marked as misclassified No
Message-id <1283376780.15.0.831048812101.issue2986@psf.upfronthosting.co.za>
In-reply-to
Content
While refactoring the code for 2.7, I discovered that the description of the heuristic for 2.6 and in the code comments is off by 1. "items that appear more than 1% of the time" should actually be "items whose duplicates (after the first) appear more than 1% of the time". The discrepancy arises because in the following code

        for i, elt in enumerate(b):
            if elt in b2j:
                indices = b2j[elt]
                if n >= 200 and len(indices) * 100 > n:
                    populardict[elt] = 1
                    del indices[:]
                else:
                    indices.append(i)
            else:
                b2j[elt] = [i]

len(indices) is retrieved *before* the index i of the current elt is added. Whatever one might think the heuristic 'should' have been (and by the nature of heuristics, there is no right answer), the default behavior must remain as it is, so we adjusted the code and doc to match that.
History
Date User Action Args
2010-09-01 21:33:00terry.reedysetrecipients: + terry.reedy, tim.peters, barry, georg.brandl, jimjjewett, sjmachin, gjb1002, ggenellina, pitrou, rtvd, vbr, LambertDW, hagna, r.david.murray, eli.bendersky, janpf, mrotondo
2010-09-01 21:33:00terry.reedysetmessageid: <1283376780.15.0.831048812101.issue2986@psf.upfronthosting.co.za>
2010-09-01 21:32:58terry.reedylinkissue2986 messages
2010-09-01 21:32:58terry.reedycreate