Message 234685 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	rhettinger
Recipients	pitrou, rhettinger, serhiy.storchaka
Date	2015-01-25.20:31:31
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1422217891.56.0.664510306819.issue23269@psf.upfronthosting.co.za>
In-reply-to

Content
Reopening this with a better patch that nicely tightens the inner-loop without an algorithmic change or any downside. Currently, the entry computation adds a constant (i+j), masks it (&mask) to wrap on overflow, scales it to a table index (<< 4), and then does the key lookup (one memory access). These steps are sequential (not parallizeable). The patch moves the wrap-on-overflow decision out of the inner loop (adding a single, fast, predictable branch (i + LINEAR_PROBES > mask). The inner-loop then simplifies to an add, fetch, and test (no masking and shifting). The generated code on Clang is very tight: LBB29_26: cmpq $0, (%rsi) # entry->key == NULL je LBB29_30 incq %rdx # j++ addq $16, %rsi # entry++ (done in parallel with the incq) cmpq $9, %rdx # j <= LINEAR_PROBES jbe LBB29_26 On gcc-4.9, the loop is unrolled and we get even better code: leaq (%rbx,%rsi), %rdx # entry = table[i] cmpq $0, (%rdx) # entry.key == NULL je L223 leaq 16(%rbx,%rsi), %rdx # entry = table[i+1]; cmpq $0, (%rdx) # entry.key == NULL je L223 leaq 32(%rbx,%rsi), %rdx cmpq $0, (%rdx) je L223

Reopening this with a better patch that nicely tightens the inner-loop without an algorithmic change or any downside.

Currently, the entry computation adds a constant (i+j), masks it (&mask) to wrap on overflow, scales it to a table index (<< 4), and then does the key lookup (one memory access).   These steps are sequential (not parallizeable).

The patch moves the wrap-on-overflow decision out of the inner loop (adding a single, fast, predictable branch (i + LINEAR_PROBES > mask).   The inner-loop then simplifies to an add, fetch, and test (no masking and shifting).

The generated code on Clang is very tight:

LBB29_26:      
    cmpq    $0, (%rsi)       # entry->key == NULL
    je  LBB29_30
    incq    %rdx             # j++
    addq    $16, %rsi        # entry++  (done in parallel with the incq)
    cmpq    $9, %rdx         # j <= LINEAR_PROBES
    jbe LBB29_26

On gcc-4.9, the loop is unrolled and we get even better code:

    leaq    (%rbx,%rsi), %rdx   # entry = table[i]
    cmpq    $0, (%rdx)          # entry.key == NULL
    je  L223

    leaq    16(%rbx,%rsi), %rdx # entry = table[i+1];
    cmpq    $0, (%rdx)          # entry.key == NULL
    je  L223

    leaq    32(%rbx,%rsi), %rdx
    cmpq    $0, (%rdx)
    je  L223

History
Date	User	Action	Args
2015-01-25 20:31:31	rhettinger	set	recipients: + rhettinger, pitrou, serhiy.storchaka
2015-01-25 20:31:31	rhettinger	set	messageid: <1422217891.56.0.664510306819.issue23269@psf.upfronthosting.co.za>
2015-01-25 20:31:31	rhettinger	link	issue23269 messages
2015-01-25 20:31:31	rhettinger	create