classification
Title: Speed up list.__eq__ by about 6%
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.9
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Dennis Sweeney, corona10, rhettinger, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2020-02-24 04:56 by Dennis Sweeney, last changed 2020-03-02 17:39 by vstinner. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 18638 merged Dennis Sweeney, 2020-02-24 04:57
Messages (12)
msg362566 - (view) Author: Dennis Sweeney (Dennis Sweeney) * Date: 2020-02-24 04:56
The following tiny change:

diff --git a/Objects/listobject.c b/Objects/listobject.c
index 3c39c6444b..3ac03b71d0 100644
--- a/Objects/listobject.c
+++ b/Objects/listobject.c
@@ -2643,8 +2643,7 @@ list_richcompare(PyObject *v, PyObject *w, int op)

         Py_INCREF(vitem);
         Py_INCREF(witem);
-        int k = PyObject_RichCompareBool(vl->ob_item[i],
-                                         wl->ob_item[i], Py_EQ);
+        int k = PyObject_RichCompareBool(vitem, witem, Py_EQ);
         Py_DECREF(vitem);
         Py_DECREF(witem);
         if (k < 0)

Creates the following performance improvement:

Before:
> .\python.bat -m timeit -s "A = list(range(10**7)); B = list(range(10**7))" "A==B"
2 loops, best of 5: 134 msec per loop
> .\python.bat -m timeit -s "A = list(range(10**7)); B = list(range(10**7))" "A==B"
2 loops, best of 5: 134 msec per loop

After:
> .\python.bat -m timeit -s "A = list(range(10**7)); B = list(range(10**7))" "A==B"
2 loops, best of 5: 126 msec per loop
> .\python.bat -m timeit -s "A = list(range(10**7)); B = list(range(10**7))" "A==B"
2 loops, best of 5: 126 msec per loop
msg362646 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2020-02-25 15:08
IMHO, I can not see a noticeable performance improvement.
I think that the modern compiler will optimize the reused variable. ;)
The below result might be noise.


[master]
./python.exe -m pyperf timeit "A = list(range(10**3)); B = list(range(10**3))" "A==B"
.....................
Mean +- std dev: 415 us +- 6 us

[PR 18638]
./python.exe -m pyperf timeit "A = list(range(10**3)); B = list(range(10**3))" "A==B"
.....................
Mean +- std dev: 433 us +- 20 us
msg362654 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-02-25 18:16
I have doubts that such small change can lead to significant speed up. It is likely a compiler glitch.

But this change makes the code clearer. If you remove a NEWS entry with a doubtful promise I will accept it.
msg362665 - (view) Author: Dennis Sweeney (Dennis Sweeney) * Date: 2020-02-25 20:30
I made the requested changes to reflect that this is for code cleanliness rather than strictly for performance.

However, it appears that Visual Studio on Windows 10 was not doing the optimization one might expect. In particular, here is the disassembly before this PR:

===========================================
        PyObject *vitem = vl->ob_item[i];
78E20C49  mov         eax,dword ptr [edx+0Ch]  
        PyObject *vitem = vl->ob_item[i];
78E20C4C  mov         ecx,dword ptr [eax+edi*4]  
        PyObject *witem = wl->ob_item[i];
78E20C4F  mov         eax,dword ptr [esi+0Ch]  
78E20C52  mov         dword ptr [vitem],ecx  
78E20C55  mov         esi,dword ptr [eax+edi*4]  
78E20C58  mov         dword ptr [witem],esi  
        if (vitem == witem) {
78E20C5B  cmp         ecx,esi  
78E20C5D  je          list_richcompare+12Bh (78E20CFBh)  
            continue;
        }

        Py_INCREF(vitem);
78E20C63  inc         dword ptr [ecx]  
        Py_INCREF(witem);
78E20C65  inc         dword ptr [esi]  
        int k = PyObject_RichCompareBool(vl->ob_item[i], 
78E20C67  mov         eax,dword ptr [w]  
78E20C6A  mov         eax,dword ptr [eax+0Ch]  
78E20C6D  mov         ebx,dword ptr [eax+edi*4]  
78E20C70  mov         eax,dword ptr [edx+0Ch]  
78E20C73  mov         eax,dword ptr [eax+edi*4]  
78E20C76  cmp         eax,ebx  
78E20C78  jne         list_richcompare+0B1h (78E20C81h)  
78E20C7A  mov         ebx,1  
78E20C7F  jmp         list_richcompare+100h (78E20CD0h)  
78E20C81  push        2  
78E20C83  push        ebx  
78E20C84  push        eax  
78E20C85  call        PyObject_RichCompare (78E35120h)  
78E20C8A  mov         esi,eax  
78E20C8C  add         esp,0Ch  
78E20C8F  test        esi,esi  
78E20C91  jne         list_richcompare+0C8h (78E20C98h)  
78E20C93  or          ebx,0FFFFFFFFh  
78E20C96  jmp         list_richcompare+0FAh (78E20CCAh)  
78E20C98  cmp         dword ptr [esi+4],offset PyBool_Type (790A3420h)  
===========================================

And after this PR:
===========================================
        PyObject *vitem = vl->ob_item[i];
795E0C4A  mov         eax,dword ptr [edx+0Ch]  
        PyObject *vitem = vl->ob_item[i];
795E0C4D  mov         ebx,dword ptr [eax+ecx*4]  
        PyObject *witem = wl->ob_item[i];
795E0C50  mov         eax,dword ptr [edi+0Ch]  
795E0C53  mov         edi,dword ptr [eax+ecx*4]  
        if (vitem == witem) {
795E0C56  cmp         ebx,edi  
795E0C58  je          list_richcompare+109h (795E0CD9h)  
            continue;
        }

        Py_INCREF(vitem);
795E0C5A  inc         dword ptr [ebx]  
        Py_INCREF(witem);
795E0C5C  inc         dword ptr [edi]  
        int k = PyObject_RichCompareBool(vitem, witem, Py_EQ);
795E0C5E  push        2  
795E0C60  push        edi  
795E0C61  push        ebx  
795E0C62  call        PyObject_RichCompare (795F5120h)  
795E0C67  mov         esi,eax  
795E0C69  add         esp,0Ch  
795E0C6C  test        esi,esi  
795E0C6E  jne         list_richcompare+0A5h (795E0C75h)  
795E0C70  or          esi,0FFFFFFFFh  
795E0C73  jmp         list_richcompare+0DBh (795E0CABh)  
795E0C75  cmp         dword ptr [esi+4],offset PyBool_Type (79863420h)  
795E0C7C  jne         list_richcompare+0BBh (795E0C8Bh)  
795E0C7E  xor         eax,eax  
795E0C80  cmp         esi,offset _Py_TrueStruct (798633FCh)  
795E0C86  sete        al  
795E0C89  jmp         list_richcompare+0C4h (795E0C94h)  
795E0C8B  push        esi  
795E0C8C  call        PyObject_IsTrue (795F62F0h)  
795E0C91  add         esp,4  
795E0C94  add         dword ptr [esi],0FFFFFFFFh  
795E0C97  mov         dword ptr [k],eax  
795E0C9A  jne         list_richcompare+0D8h (795E0CA8h)  
===========================================
msg362676 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2020-02-26 02:07
>  it appears that Visual Studio on Windows 10 was not doing the optimization one might expect.

Hmm, Is this build on release mode?
msg362678 - (view) Author: Dennis Sweeney (Dennis Sweeney) * Date: 2020-02-26 02:50
> Hmm, Is this build on release mode?

Yes--in debug mode, each Py_INCREF is these 8 instructions:

========================
78BD071E  mov         ecx,dword ptr [_Py_RefTotal (79039700h)]  
78BD0724  add         ecx,1  
78BD0727  mov         dword ptr [_Py_RefTotal (79039700h)],ecx  
78BD072D  mov         edx,dword ptr [ebp-10h]  
78BD0730  mov         eax,dword ptr [edx]  
78BD0732  add         eax,1  
78BD0735  mov         ecx,dword ptr [ebp-10h]  
78BD0738  mov         dword ptr [ecx],eax  
========================

instead of just the one `inc` in release mode.
msg362679 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2020-02-26 03:09
Debug mode is not meaningful.
Visual Studio will optimize fully on release mode.
msg362680 - (view) Author: Dennis Sweeney (Dennis Sweeney) * Date: 2020-02-26 03:57
> Debug mode is not meaningful.
> Visual Studio will optimize fully on release mode.

Sorry if I wasn't clear--the original assembly difference I posted in (https://bugs.python.org/msg362665) was indeed using the "release" build configuration. My last comment was trying to demonstrate this fact by showing that the assembly output would have looked different if I had selected the "debug" build configuration. 

Is there something I'm missing, or was Visual Studio just missing an opportunity to optimize?
msg362681 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-02-26 04:42
Speed-up or not, I concur that the new code it clearer.
msg362683 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-02-26 07:00
New changeset be7ead62db9a1db3e2cd997b0beffd4480e51f5c by sweeneyde in branch 'master':
bpo-39737: Remove code repitition in list_richcompare (GH-18638)
https://github.com/python/cpython/commit/be7ead62db9a1db3e2cd997b0beffd4480e51f5c
msg362684 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-02-26 07:01
Thank you for your contribution Dennis.
msg363193 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-02 17:39
I suggest to use LTO + PGO optimizations when benchmarking Python:
https://pyperformance.readthedocs.io/usage.html#how-to-get-stable-benchmarks

Benchmarking is hard :-/
History
Date User Action Args
2020-03-02 17:39:23vstinnersetmessages: + msg363193
2020-02-26 07:01:45serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: + msg362684

stage: patch review -> resolved
2020-02-26 07:00:39serhiy.storchakasetmessages: + msg362683
2020-02-26 04:42:48rhettingersetnosy: + rhettinger
messages: + msg362681
2020-02-26 03:57:32Dennis Sweeneysetmessages: + msg362680
2020-02-26 03:09:47corona10setmessages: + msg362679
2020-02-26 02:50:23Dennis Sweeneysetmessages: + msg362678
2020-02-26 02:07:31corona10setmessages: + msg362676
2020-02-25 20:30:06Dennis Sweeneysetmessages: + msg362665
2020-02-25 18:16:51serhiy.storchakasetmessages: + msg362654
2020-02-25 15:08:23corona10setmessages: + msg362646
2020-02-24 16:06:00SilentGhostsetnosy: + vstinner, serhiy.storchaka, corona10
2020-02-24 04:57:41Dennis Sweeneysetkeywords: + patch
stage: patch review
pull_requests: + pull_request18004
2020-02-24 04:56:41Dennis Sweeneycreate