Message 287272 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	rhettinger
Recipients	rhettinger, serhiy.storchaka
Date	2017-02-08.03:24:31
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1486524274.18.0.162976124513.issue29476@psf.upfronthosting.co.za>
In-reply-to

Content
Dummy entries created by deletions are currently handled in two places. The code in set_add_entry() tracks a freeslot to allow dummy slots to be reused when new elements are inserted. The dummies are also eliminated during resizing. The freeslot checks are made in the inner-loop of set_add_entry() and will only eliminate a few of the dummies (ones that happen to be in a slot that we want to use again). In contrast, the resize routine eliminates 100% of the dummy entries -- this is where most dummies get reclaimed. This is a preliminary draft of an idea to only use resizing for dummy reclamation. It proposes eliminating the freeslot reuse step in set_add_entry(). One advantage is that the patch nicely simplifies the code in set_add_entry and makes its inner-loop tighter (eliminating one memory access out of the two on each iteration). With the advent of linear probing and the 60% set load factor, set resizing has become very cheap. One issue with the current design is the while a few dummies get reclaimed earlier, others are left alive longer, causing more collisions during lookups. The reuse of a few dummies defers the resizing step that would actually clear out all of the dummies. Another thought is that the proposal gives a better separation of concerns. Insertion focuses on adding entries without making inner loop checks for an uncommon case (the relevant note in dictobject.c says this is uncommon by a factor of hundreds). The patch leaves the dummy cleanup to the fast, efficient resizing step. Some areas for exploration: * We know that the benefit to set_add_entry() will reduce the cost of building a set (which is the common case). It simply does less work by using less code overall and by having less code in the inner loop (see the disassembly below (1) to see that we're now down to one memory access per iteration). That said, it would be nice to quantify whether this is a small improvement or a big one (I expect the answer will be different for CLang vs GCC vs MSC, for 32-bit builds versus 64-builds, and for ARM versus x86, etc). * It would also be nice see if there is an observable small benefit or small detriment to the uncommon case of building a set followed by cycling back and forth between removing elements and adding new elements. In any case, the simplification of the code in set_add_entry() will be nice and it would be good to have a separation of concerns with only set_resize being responsible for clearing out dummies in a single high-speed pass. (1) Inner-loop for the new set_add_entry() featuring only one memory access per iteration: L383: cmpq %r9, %rbx ; entry < start + LINEAR_PROBES je L406 L388: addq $16, %rbx ; entry++ movq 8(%rbx), %rax ; load entry->hash from memory testq %rax, %rax ; entry->hash == 0 jne L382 ; until zero found skip next test cmpq $0, (%rbx) ; entry->key == NULL je L374 ; return entry L382: cmpq %rax, %r15 ; entry->hash == hash jne L383 ; loop back <... process to equality test >

Dummy entries created by deletions are currently handled in two places.  The code in set_add_entry() tracks a freeslot to allow dummy slots to be reused when new elements are inserted. The dummies are also eliminated during resizing.

The freeslot checks are made in the inner-loop of set_add_entry() and will only eliminate a few of the dummies (ones that happen to be in a slot that we want to use again).  In contrast, the resize routine eliminates 100% of the dummy entries -- this is where most dummies get reclaimed.

This is a preliminary draft of an idea to only use resizing for dummy reclamation.  It proposes eliminating the freeslot reuse step in set_add_entry().  One advantage is that the patch nicely simplifies the code in set_add_entry and makes its inner-loop tighter (eliminating one memory access out of the two on each iteration).  With the advent of linear probing and the 60% set load factor, set resizing has become very cheap.

One issue with the current design is the while a few dummies get reclaimed earlier, others are left alive longer, causing more collisions during lookups. The reuse of a few dummies defers the resizing step that would actually clear out all of the dummies.

Another thought is that the proposal gives a better separation of concerns.  Insertion focuses on adding entries without making inner loop checks for an uncommon case (the relevant note in dictobject.c says this is uncommon by a factor of hundreds).  The patch leaves the dummy cleanup to the fast, efficient resizing step.

Some areas for exploration:

* We know that the benefit to set_add_entry() will reduce the cost of building a set (which is the common case).  It simply does less work by using less code overall and by having less code in the inner loop (see the disassembly below (1) to see that we're now down to one memory access per iteration).  That said, it would be nice to quantify whether this is a small improvement or a big one (I expect the answer will be different for CLang vs GCC vs MSC, for 32-bit builds versus 64-builds, and for ARM versus x86, etc).

* It would also be nice see if there is an observable small benefit or small detriment to the uncommon case of building a set followed by cycling back and forth between removing elements and adding new elements.

In any case, the simplification of the code in set_add_entry() will be nice and it would be good to have a separation of concerns with only set_resize being responsible for clearing out dummies in a single high-speed pass.

(1) Inner-loop for the new set_add_entry() featuring only one memory access per iteration:

    L383:
        cmpq    %r9, %rbx       ; entry < start + LINEAR_PROBES
        je  L406                
    L388:
        addq    $16, %rbx       ; entry++ 
        movq    8(%rbx), %rax   ; load entry->hash from memory
        testq   %rax, %rax      ; entry->hash == 0
        jne L382                ; until zero found skip next test  
        cmpq    $0, (%rbx)      ; entry->key == NULL
        je  L374                ; return entry
    L382:
        cmpq    %rax, %r15      ; entry->hash == hash
        jne L383                ; loop back
    <... process to equality test >

History
Date	User	Action	Args
2017-02-08 03:24:34	rhettinger	set	recipients: + rhettinger, serhiy.storchaka
2017-02-08 03:24:34	rhettinger	set	messageid: <1486524274.18.0.162976124513.issue29476@psf.upfronthosting.co.za>
2017-02-08 03:24:33	rhettinger	link	issue29476 messages
2017-02-08 03:24:31	rhettinger	create