Message 78910 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	alexandre.vassalotti
Recipients	alexandre.vassalotti, blaisorblade, christian.heimes, lemburg, pitrou, rhettinger, skip.montanaro
Date	2009-01-03.00:45:25
SpamBayes Score	4.6852244e-11
Marked as misclassified	No
Message-id	<1230943529.41.0.969700026517.issue4753@psf.upfronthosting.co.za>
In-reply-to

Content
The patch make a huge difference on 64-bit Linux. I get a 20% speed-up and the lowest run time so far. That is quite impressive! At first glance, it seems the extra registers of the x86-64 architecture permit GCC to avoid spilling registers onto the stack (see assembly just below). However, I don't know why the speed up due to the patch is much more significant on x86-64 than on x86. This is the x86 assembly generated by GCC 4.3 (annotated and slightly edited for readability): movl -440(%ebp), %eax # tmp = next_instr movl $145, %esi # opcode = LIST_APPEND movl 8(%ebp), %ecx # f subl -408(%ebp), %eax # tmp -= first_instr movl %eax, 60(%ecx) # f->f_lasti = tmp movl -440(%ebp), %ebx # next_instr movzbl (%ebx), %eax # tmp = next_instr addl $1, %ebx # next_instr++ movl %ebx, -440(%ebp) # next_instr movl opcode_targets(,%eax,4), %eax # tmp = opcode_targets[tmp] jmp %eax # goto tmp And this is the x86-64 assembly generated also by GCC 4.3: movl %r15d, %eax # tmp = next_instr subl 76(%rsp), %eax # tmp -= first_instr movl $145, %ebp # opcode = LIST_APPEND movl %eax, 120(%r14) # f->f_lasti = tmp movzbl (%r15), %eax # tmp = next_instr addq $1, %r15 # next_instr++ movq opcode_targets(,%rax,8), %rax # tmp = opcode_targets[tmp] jmp %rax # goto tmp The above assemblies are equivalent to the following C code: opcode = LIST_APPEND; f->f_lasti = ((int)(next_instr - first_instr)); goto opcode_targets[next_instr++]; On the register-starved x86 architecture, the assembly has 4 stack load and 1 store operations. While on the x86-64 architecture, most variables are kept in registers thus it only uses 1 stack store operation. And from what I saw from the assemblies, the extra registers with the traditional switch dispatch aren't much used, especially with the opcode prediction macros which avoid manipulations of f->f_lasti. That said, I am glad to hear the patch makes Python on PowerPC faster, because this supports the hypothesis that extra registers are better used with indirect threading (PowerPC has 32 general-purpose registers).

The patch make a huge difference on 64-bit Linux. I get a 20% speed-up
and the lowest run time so far. That is quite impressive!

At first glance, it seems the extra registers of the x86-64 architecture
permit GCC to avoid spilling registers onto the stack (see assembly just
below). However, I don't know why the speed up due to the patch is much
more significant on x86-64 than on x86.

This is the x86 assembly generated by GCC 4.3 (annotated and
slightly edited for readability):

    movl    -440(%ebp), %eax  # tmp = next_instr
    movl    $145, %esi        # opcode = LIST_APPEND
    movl    8(%ebp), %ecx     # f
    subl    -408(%ebp), %eax  # tmp -= first_instr
    movl    %eax, 60(%ecx)    # f->f_lasti = tmp
    movl    -440(%ebp), %ebx  # next_instr
    movzbl  (%ebx), %eax      # tmp = *next_instr
    addl    $1, %ebx          # next_instr++
    movl    %ebx, -440(%ebp)  # next_instr
    movl    opcode_targets(,%eax,4), %eax  # tmp = opcode_targets[tmp]
    jmp     *%eax             # goto *tmp


And this is the x86-64 assembly generated also by GCC 4.3:

    movl    %r15d, %eax      # tmp = next_instr
    subl    76(%rsp), %eax   # tmp -= first_instr
    movl    $145, %ebp       # opcode = LIST_APPEND
    movl    %eax, 120(%r14)  # f->f_lasti = tmp
    movzbl  (%r15), %eax     # tmp = *next_instr
    addq    $1, %r15         # next_instr++
    movq    opcode_targets(,%rax,8), %rax  # tmp = opcode_targets[tmp]
    jmp     *%rax            # goto *tmp


The above assemblies are equivalent to the following C code:

    opcode = LIST_APPEND;
    f->f_lasti = ((int)(next_instr - first_instr));
    goto *opcode_targets[*next_instr++];

On the register-starved x86 architecture, the assembly has 4 stack load
and 1 store operations. While on the x86-64 architecture, most variables
are kept in registers thus it only uses 1 stack store operation. And
from what I saw from the assemblies, the extra registers with the
traditional switch dispatch aren't much used, especially with the opcode
prediction macros which avoid manipulations of f->f_lasti.

That said, I am glad to hear the patch makes Python on PowerPC faster,
because this supports the hypothesis that extra registers are better
used with indirect threading (PowerPC has 32 general-purpose registers).

History
Date	User	Action	Args
2009-01-03 00:45:29	alexandre.vassalotti	set	recipients: + alexandre.vassalotti, lemburg, skip.montanaro, rhettinger, pitrou, christian.heimes, blaisorblade
2009-01-03 00:45:29	alexandre.vassalotti	set	messageid: <1230943529.41.0.969700026517.issue4753@psf.upfronthosting.co.za>
2009-01-03 00:45:28	alexandre.vassalotti	link	issue4753 messages
2009-01-03 00:45:27	alexandre.vassalotti	create