Message 372441 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	rhettinger
Recipients	mark.dickinson, rhettinger, tim.peters
Date	2020-06-26.20:38:40
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1593203921.14.0.316410898298.issue41131@roundup.psfhosted.org>
In-reply-to

Content
For n unequal weights and k selections, sample selection with the inverse-cdf method is O(k log₂ n). Using the alias method, it improves to O(k). The proportionally constants also favor the alias method so that if the set up times were the same, the alias method would always win (even when n=2). However, the set up times are not the same. For the inverse-cdf method, set up is O(1) if cum_weights are given; otherwise, it is O(n) with a fast loop. The setup time for the alias method is also O(n) but is proportionally much slower. So, there would need to be a method selection heuristic based on the best trade-off between setup time and sample selection time. Both methods make k calls to random(). See: https://en.wikipedia.org/wiki/Alias_method Notes on the attached draft implementation: * Needs to add back the error checking code. * Need a better method selection heuristic. * The alias table K defaults to the original index so that there is always a valid selection even if there are small rounding errors. * The condition for the aliasing loop is designed to have an early-out when the remaining blocks all have equal weights. Also, the loop condition makes sure that the pops never fail even if there are small rounding errors when partitioning oversized bins or if the sum of weights isn't exactly 1.0.

For n unequal weights and k selections, sample selection with the inverse-cdf method is O(k log₂ n).  Using the alias method, it improves to O(k).  The proportionally constants also favor the alias method so that if the set up times were the same, the alias method would always win (even when n=2).

However, the set up times are not the same.  For the inverse-cdf method, set up is O(1) if cum_weights are given; otherwise, it is O(n) with a fast loop.  The setup time for the alias method is also O(n) but is proportionally much slower.

So, there would need to be a method selection heuristic based on the best trade-off between setup time and sample selection time.

Both methods make k calls to random().

See: https://en.wikipedia.org/wiki/Alias_method

Notes on the attached draft implementation:

* Needs to add back the error checking code.

* Need a better method selection heuristic.

* The alias table K defaults to the original index
  so that there is always a valid selection even
  if there are small rounding errors.

* The condition for the aliasing loop is designed
  to have an early-out when the remaining blocks
  all have equal weights.  Also, the loop condition
  makes sure that the pops never fail even if there
  are small rounding errors when partitioning
  oversized bins or if the sum of weights isn't
  exactly 1.0.

History
Date	User	Action	Args
2020-06-26 20:38:41	rhettinger	set	recipients: + rhettinger, tim.peters, mark.dickinson
2020-06-26 20:38:41	rhettinger	set	messageid: <1593203921.14.0.316410898298.issue41131@roundup.psfhosted.org>
2020-06-26 20:38:41	rhettinger	link	issue41131 messages
2020-06-26 20:38:40	rhettinger	create