Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up hash(fractions.Fraction) #82044

Closed
tim-one opened this issue Aug 15, 2019 · 12 comments
Closed

Speed up hash(fractions.Fraction) #82044

tim-one opened this issue Aug 15, 2019 · 12 comments
Labels
3.9 only security fixes performance Performance or resource usage

Comments

@tim-one
Copy link
Member

tim-one commented Aug 15, 2019

BPO 37863
Nosy @tim-one, @rhettinger, @mdickinson
PRs
  • bpo-37863: Optimize Fraction.__hash__() #15298
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-08-16.04:00:13.295>
    created_at = <Date 2019-08-15.02:30:45.146>
    labels = ['3.9', 'performance']
    title = 'Speed up hash(fractions.Fraction)'
    updated_at = <Date 2019-08-18.22:22:47.434>
    user = 'https://github.com/tim-one'

    bugs.python.org fields:

    activity = <Date 2019-08-18.22:22:47.434>
    actor = 'tim.peters'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-08-16.04:00:13.295>
    closer = 'rhettinger'
    components = []
    creation = <Date 2019-08-15.02:30:45.146>
    creator = 'tim.peters'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 37863
    keywords = ['patch']
    message_count = 12.0
    messages = ['349789', '349790', '349810', '349811', '349814', '349815', '349819', '349836', '349842', '349843', '349916', '349926']
    nosy_count = 3.0
    nosy_names = ['tim.peters', 'rhettinger', 'mark.dickinson']
    pr_nums = ['15298']
    priority = 'low'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue37863'
    versions = ['Python 3.9']

    @tim-one
    Copy link
    Member Author

    tim-one commented Aug 15, 2019

    Recording before I forget. These are easy:

    1. As the comments note, cache the hash code.

    2. Use the new (in 3.8) pow(denominator, -1, modulus) to get the inverse instead of raising to the modulus-2 power. Should be significantly faster. If not, the new "-1" implementation should be changed ;-) Will require catching ValueError in case the denom is a multiple of the modulus.

    3. Instead of multiplying by the absolute value of the numerator, multiply by the hash of the absolute value of the numerator. That changes the multiplication, and the subsequent modulus operation, from unbounded-length operations to short bounded-length ones. Hashing the numerator on its own should be significantly faster, because the int hash doesn't require any multiplies or divides regardless of how large the numerator is.

    None of those should change any computed results.

    @tim-one tim-one added 3.9 only security fixes performance Performance or resource usage labels Aug 15, 2019
    @pppery pppery mannequin changed the title Speed hash(fractions.Fraction) Speed up hash(fractions.Fraction) Aug 15, 2019
    @rhettinger
    Copy link
    Contributor

    I make a quick PR for you. Skipped #1 because I don't think Fraction hashing is worth adding another slot.

    @mdickinson
    Copy link
    Member

    Should be significantly faster. If not, the new "-1" implementation should be changed ;-)

    I wouldn't have bet on this, before seeing Raymond's benchmark results. Writing a fast path for invmod for C-size integers is still on my to-do list; the current implementation does way too many Python-level divisions.

    @tim-one
    Copy link
    Member Author

    tim-one commented Aug 15, 2019

    Why I expected a major speedup from this: the binary exponentiation routine (for "reasonably small" exponents) does 30 * ceiling(exponent.bit_length() / 30) multiply-and-reduces, plus another for each bit set in the exponent. That's a major difference between exponents of bit lengths 61 ((P-2).bit_length()) and 1 ((1).bit_length()). Indeed, I bet it would pay in long_pow() to add another test, under the if (Py_SIZE(b) < 0) branch, to skip the exponentiation part entirely when b is -1. long_invmod() would be the end of it then. Because I expect using an exponent of -1 for modular inverse will be overwhelmingly more common than using any other negative exponent with a modulus.

    @mdickinson
    Copy link
    Member

    That's a major difference between exponents of bit lengths 61 ((P-2).bit_length()) and 1 ((1).bit_length()).

    Right, but that's stacked up against the cost of the extended Euclidean algorithm for computing the inverse. The extended gcd for computing the inverse of 1425089352415399815 (for example) modulo 2**61 - 1 takes 69 steps, each one of which involves a PyLong quotient-and-remainder division, a PyLong multiplication and a subtraction. So that's at least the same order of magnitude when it comes to number of operations.

    I'd bet that a dedicated pure C square-and-multiply algorithm (with an addition chain specifically chosen for the target modulus, and with the multiplication and reduction specialised for the particular form of the modulus) would still be the fastest way to go here. I believe optimal addition chains for 2**31-3 are known, and it shouldn't be too hard to find something close-to-optimal (as opposed to proved optimal) for 2**61-3.

    @mdickinson
    Copy link
    Member

    Indeed, I bet it would pay in long_pow() to add another test, under the if (Py_SIZE(b) < 0) branch, to skip the exponentiation part entirely when b is -1.

    Agreed.

    @tim-one
    Copy link
    Member Author

    tim-one commented Aug 15, 2019

    Well, details matter ;-) Division in Python is expensive. In the exponentiation algorithm each reduction (in general) requires a 122-by-61 bit division. In egcd, after it gets going nothing exceeds 61 bits, and across iterations the inputs to the division step get smaller each time around.

    So, e.g., when Raymond tried a Fraction with denominator 5813, "almost all" the egcd divisions involved inputs each with a single internal Python "digit". But "almost all" the raise-to-the-(P-2) divisions involve a numerator with 5 internal digts and 3 in the denominator. Big difference, even if the total number of divisions is about the same.

    @tim-one
    Copy link
    Member Author

    tim-one commented Aug 15, 2019

    Mark, I did just a little browsing on this. It seems it's well known that egcd beats straightforward exponentiation for this purpose in arbitrary precision contexts, for reasons already sketched (egcd needs narrower arithmetic from the start, benefits from the division narrowing on each iteration, and since the quotient on each iteration usually fits in a single "digit" the multiplication by the quotient goes fast too).

    But gonzo implementations switch back to exponentiation, using fancier primitives like Montgomery multiplication.

    As usual, I'm not keen on bloating the code for "state of the art" giant int algorithms, but suit yourself! The focus in this PR is dead simple spelling changes with relatively massive payoffs.

    @rhettinger
    Copy link
    Contributor

    New changeset f3cb68f by Raymond Hettinger in branch 'master':
    bpo-37863: Optimize Fraction.__hash__() (bpo-15298)
    f3cb68f

    @rhettinger
    Copy link
    Contributor

    There's likely more that could be done -- I've just taken the low hanging fruit. If someone wants to re-open this and go farther, please take it from here.

    @tim-one
    Copy link
    Member Author

    tim-one commented Aug 18, 2019

    Some random notes:

    • 1425089352415399815 appears to be derived from using the golden ratio to contrive a worst case for the Euclid egcd method. Which it's good at :-) Even so, the current code runs well over twice as fast as when replacing the pow(that, -1, P) with pow(that, P-2, P).

    • Using binary gcd on the same thing requires only 46 iterations - and, of course, no divisions at all. So that could be a big win. There's no possible way to get exponentiation to require less than 60 iterations, because it requires that many squarings just to reach the high bit of P-2. However, I never finishing working out how to extend binary gcd to return inverses too.

    • All cases are "bad" for pow(whatever, P-2, P) because P-2 has 60 bits set. So we currently need 60 multiplies to account for those, in addition to 60 squarings to reach P-2's high bit. A significant speedup could probably have been gotten just by rewriting whatever**(P-2) as

      (whatever ** 79511827903920481) ** 29

    That is, express P-2 as its prime factorization. There are 28 zero bits in the larger factor, so would save 28 multiply steps right there. Don't know how far that may yet be from an optimal addition chain for P-2.

    • The worst burden of the P-2-power method is that there's no convenient way to exploit that % P _can_ be very cheap, because P has a very special value. The power method actually needs to divide by P on each step. As currently coded, the egcd method never needs to divide by P (although _in_ the egcd part it divides narrower & narrower numbers < P).

    • Something on my todo list forever was writing an internal routine for squaring, based on that (a+b)**2 = a**2 + 2ab + b**2. That gives Karatsuba-like O() speedup but with simpler code, enough simpler that it could probably be profitably applied even to a relatively small argument.

    Which of those do I intend to pursue? Probably none :-( But I confess to be _annoyed_ at giving up on extending binary gcd to compute an inverse, so I may at least do that in Python before I die ;-)

    @tim-one
    Copy link
    Member Author

    tim-one commented Aug 18, 2019

    For posterity:

    "Modular Inverse Algorithms Without Multiplications for Cryptographic Applications"
    Laszlo Hars
    https://link.springer.com/article/10.1155/ES/2006/32192
    

    """
    On the considered computational platforms for operand lengths used in cryptography, the fastest presented modular inverse algorithms need about twice the time of modular multiplications, or
    even less.
    """

    Lars considered a great many algorithm variations, and wrote up about ten of the best. Several things surprised me here. Most surprising: under some realistic assumptions, the best turned out to be a relatively simple variation of Euclid extended GCD. Nutshell: during a step, let the difference between the bit lengths of the then-current numerator and denominator be f. Then look at a few leading bits to pick whichever of s in {f-1, f, f+1} will clear the largest handful of leading bits in numerator - (denominator << s) (this test is meant to be fast, not exact - it's a refinement of an easier variant that just picks s=f). The "quotient" in this step is then 2**s, and the rest is shifting and adding/subtracting. No division or multiplication is needed.

    This has a larger expected iteration count than some other methods, but, like regular old Euclid GCD, needs relatively little code per iteration.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants