Message 374622 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients	mark.dickinson, rhettinger, tim.peters
Date	2020-07-30.20:27:49
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1596140869.89.0.17576207927.issue41131@roundup.psfhosted.org>
In-reply-to

Content
I'm not clear on that the alias method is valuable here. Because of the preprocessing expense, it cries out for a class instead: an object that can retain the preprocessed tables, be saved to disk (pickled), restored later, and used repeatedly to make O(1) selections. I have such a class, which uses exact integer arithmetic (e.g., during preprocessing, there's at least one "too small" bin if and only if there's at least one "too large" bin), and is as unbiased as choice() and randrange(). Reasoning about float vagaries is much harder, and I see no error analysis here. Most troubling: that due to rounding, reducing a "too large" bin results in a weight that spuriously compares <= bin_size. Then a "too small" bin with a genuinely tiny weight can be left with nothing to pair with, and so ends up aliasing itself, giving it a massively too-high probability of being picked. So if you're determined to stick to "a function", I'd stick to the simple inverse CDF sampling method. Unless the number of choices is "very" large, the log factor doesn't much matter. That said, the alias numerics here could be improved some by working with the weights directly, "normalizing" them only at the end. Instead of comparing weights to bin_size, compare them instead to the mean: mean = total / n (note: the next step to getting rid of floats entirely is to pre-multiply the weights by lcm(total, n) // total - then their mean is exactly the integer lcm(total, n) // n). The mean (instead of bin_size) suffers rounding errors then, but the original weights don't during the "hard part" of preprocessing. At the end, fn = n + 0.0 U = [weights[i] / total + i / fn for i in range(n)] instead. The "i / fn" also suffers just one rounding error as opposed to the two in the current "bin_width * i". That can make a difference for n as small as 5: >>> 3/5 0.6 >>> (1/5)*3 0.6000000000000001

I'm not clear on that the alias method is valuable here. Because of the preprocessing expense, it cries out for a class instead: an object that can retain the preprocessed tables, be saved to disk (pickled), restored later, and used repeatedly to make O(1) selections. I have such a class, which uses exact integer arithmetic (e.g., during preprocessing, there's at least one "too small" bin if and only if there's at least one "too large" bin), and is as unbiased as choice() and randrange(). Reasoning about float vagaries is much harder, and I see no error analysis here.

Most troubling: that due to rounding, reducing a "too large" bin results in a weight that spuriously compares <= bin_size. Then a "too small" bin with a genuinely tiny weight can be left with nothing to pair with, and so ends up aliasing itself, giving it a massively too-high probability of being picked.

So if you're determined to stick to "a function", I'd stick to the simple inverse CDF sampling method.  Unless the number of choices is "very" large, the log factor doesn't much matter.

That said, the alias numerics here could be improved some by working with the weights directly, "normalizing" them only at the end.  Instead of comparing weights to bin_size, compare them instead to the mean:

mean = total / n

(note: the next step to getting rid of floats entirely is to pre-multiply the weights by lcm(total, n) // total  - then their mean is exactly the integer lcm(total, n) // n).

The mean (instead of bin_size) suffers rounding errors then, but the original weights don't during the "hard part" of preprocessing.  At the end,

fn = n + 0.0
U = [weights[i] / total + i / fn for i in range(n)]

instead.  The "i / fn" also suffers just one rounding error as opposed to the two in the current "bin_width * i".  That can make a difference for n as small as 5:

>>> 3/5
0.6
>>> (1/5)*3
0.6000000000000001

History
Date	User	Action	Args
2020-07-30 20:27:49	tim.peters	set	recipients: + tim.peters, rhettinger, mark.dickinson
2020-07-30 20:27:49	tim.peters	set	messageid: <1596140869.89.0.17576207927.issue41131@roundup.psfhosted.org>
2020-07-30 20:27:49	tim.peters	link	issue41131 messages
2020-07-30 20:27:49	tim.peters	create