Message 186010 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	Neil.Hodgson, ethan.furman, ezio.melotti, georg.brandl, pitrou, serhiy.storchaka, vstinner
Date	2013-04-04.07:44:46
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<CAMpsgwbTF6WC0j94=zViWro5s9ezLqdF8Rfis3-PWNUmTefmaw@mail.gmail.com>
In-reply-to	<1365029696.53.0.503821886831.issue17615@psf.upfronthosting.co.za>

Content
"For 32-bit Windows, the code generated for unicode_compare is quite slow. There are either 1 or 2 kind checks in each call to PyUnicode_READ (...)" Yes, PyUnicode_READ() is slow. It should not be used in a loop. And unicode_compare() uses PyUnicode_READ() in a loop. An improvment would be to write specialized version of each combinaison of Unicode kinds: (UCS1, UCS2), (UCS1, UCS4), (UCS2, UCS1), (UCS2, UCS2), (UCS2, UCS4) (UCS4, UCS1), (UCS4, UCS2), (UCS4, UCS4) # (UCS1, UCS1) uses memcmp() But I am not convinced that the gain would be visible, and I don't know how to factorize the code. We should probably use a huge macro. 2013/4/4 Neil Hodgson <report@bugs.python.org>: > > Neil Hodgson added the comment: > > For 32-bit Windows, the code generated for unicode_compare is quite slow. > > There are either 1 or 2 kind checks in each call to PyUnicode_READ and 2 calls to PyUnicode_READ inside the loop. A compiler may decide to move the kind checks out of the loop and specialize the loop but MSVC 2010 appears to not do so. The assembler (32-bit build) for each PyUnicode_READ looks like > > mov ecx, DWORD PTR _kind1$[ebp] > cmp ecx, 1 > jne SHORT $LN17@unicode_co@2 > lea ecx, DWORD PTR [ebx+eax] > movzx edx, BYTE PTR [ecx+edx] > jmp SHORT $LN16@unicode_co@2 > $LN17@unicode_co@2: > cmp ecx, 2 > jne SHORT $LN15@unicode_co@2 > movzx edx, WORD PTR [ebx+edi] > jmp SHORT $LN16@unicode_co@2 > $LN15@unicode_co@2: > mov edx, DWORD PTR [ebx+esi] > $LN16@unicode_co@2: > > The kind1/kind2 variables aren't even going into registers and at least one test+branch and a jump are executed for every character. Two tests for 2 and 4 byte kinds. len1 and len2 don't get to go into registers either. > > My system isn't set up for 64-bit MSVC 2010 but looking at the code from 64-bit MSVC 2012 shows that all the variables have been moved into registers but the kind checking is still inside the loop. This accounts for better results with 64-bit Python 3.3 on Windows but isn't as good as Unix or Python 3.2. > > ; 10431: c1 = PyUnicode_READ(kind1, data1, i); > > cmp rsi, 1 > jne SHORT $LN17@unicode_co > lea rax, QWORD PTR [r9+rcx] > movzx r8d, BYTE PTR [rax+rbx] > jmp SHORT $LN16@unicode_co > $LN17@unicode_co: > cmp rsi, 2 > jne SHORT $LN15@unicode_co > movzx r8d, WORD PTR [r9+r11] > jmp SHORT $LN16@unicode_co > $LN15@unicode_co: > mov r8d, DWORD PTR [r9+r10] > $LN16@unicode_co: > > Attached the 32-bit assembler listing. > > ---------- > Added file: http://bugs.python.org/file29673/unicode_compare.asm > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue17615> > _______________________________________

"For 32-bit Windows, the code generated for unicode_compare is quite
slow. There are either 1 or 2 kind checks in each call to
PyUnicode_READ (...)"

Yes, PyUnicode_READ() *is* slow. It should not be used in a loop. And
unicode_compare() uses PyUnicode_READ() in a loop.

An improvment would be to write specialized version of each
combinaison of Unicode kinds:
(UCS1, UCS2), (UCS1, UCS4),
(UCS2, UCS1), (UCS2, UCS2), (UCS2, UCS4)
(UCS4, UCS1), (UCS4, UCS2), (UCS4, UCS4)
# (UCS1, UCS1) uses memcmp()

But I am not convinced that the gain would be visible, and I don't
know how to factorize the code. We should probably use a huge macro.

2013/4/4 Neil Hodgson <report@bugs.python.org>:
>
> Neil Hodgson added the comment:
>
> For 32-bit Windows, the code generated for unicode_compare is quite slow.
>
>     There are either 1 or 2 kind checks in each call to PyUnicode_READ and 2 calls to PyUnicode_READ inside the loop. A compiler may decide to move the kind checks out of the loop and specialize the loop but MSVC 2010 appears to not do so. The assembler (32-bit build) for each PyUnicode_READ looks like
>
>     mov    ecx, DWORD PTR _kind1$[ebp]
>     cmp    ecx, 1
>     jne    SHORT $LN17@unicode_co@2
>     lea    ecx, DWORD PTR [ebx+eax]
>     movzx    edx, BYTE PTR [ecx+edx]
>     jmp    SHORT $LN16@unicode_co@2
> $LN17@unicode_co@2:
>     cmp    ecx, 2
>     jne    SHORT $LN15@unicode_co@2
>     movzx    edx, WORD PTR [ebx+edi]
>     jmp    SHORT $LN16@unicode_co@2
> $LN15@unicode_co@2:
>     mov    edx, DWORD PTR [ebx+esi]
> $LN16@unicode_co@2:
>
>    The kind1/kind2 variables aren't even going into registers and at least one test+branch and a jump are executed for every character. Two tests for 2 and 4 byte kinds. len1 and len2 don't get to go into registers either.
>
>    My system isn't set up for 64-bit MSVC 2010 but looking at the code from 64-bit MSVC 2012 shows that all the variables have been moved into registers but the kind checking is still inside the loop. This accounts for better results with 64-bit Python 3.3 on Windows but isn't as good as Unix or Python 3.2.
>
> ; 10431:         c1 = PyUnicode_READ(kind1, data1, i);
>
>         cmp     rsi, 1
>         jne     SHORT $LN17@unicode_co
>         lea     rax, QWORD PTR [r9+rcx]
>         movzx   r8d, BYTE PTR [rax+rbx]
>         jmp     SHORT $LN16@unicode_co
> $LN17@unicode_co:
>         cmp     rsi, 2
>         jne     SHORT $LN15@unicode_co
>         movzx   r8d, WORD PTR [r9+r11]
>         jmp     SHORT $LN16@unicode_co
> $LN15@unicode_co:
>         mov     r8d, DWORD PTR [r9+r10]
> $LN16@unicode_co:
>
>    Attached the 32-bit assembler listing.
>
> ----------
> Added file: http://bugs.python.org/file29673/unicode_compare.asm
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue17615>
> _______________________________________

History
Date	User	Action	Args
2013-04-04 07:44:46	vstinner	set	recipients: + vstinner, georg.brandl, pitrou, ezio.melotti, ethan.furman, serhiy.storchaka, Neil.Hodgson
2013-04-04 07:44:46	vstinner	link	issue17615 messages
2013-04-04 07:44:46	vstinner	create