This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: speed-up conversion between unicode widths
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: lemburg, loewis, meador.inge, pitrou, python-dev
Priority: normal Keywords: patch

Created on 2011-10-08 22:18 by pitrou, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
unroll_convert_bytes.patch pitrou, 2011-10-08 22:18 review
unroll.c loewis, 2011-10-09 03:06
Messages (8)
msg145192 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-10-08 22:18
This patch speeds up _PyUnicode_CONVERT_BYTES by unrolling its loop.

Example micro-benchmark:

./python -m timeit -s "a='x'*10000;b='\u0102'*1000;c='\U00100000'" "a+b+c"

-> before:
100000 loops, best of 3: 14.9 usec per loop
-> after:
100000 loops, best of 3: 9.19 usec per loop
msg145193 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-10-08 22:29
Antoine Pitrou wrote:
> 
> New submission from Antoine Pitrou <pitrou@free.fr>:
> 
> This patch speeds up _PyUnicode_CONVERT_BYTES by unrolling its loop.
> 
> Example micro-benchmark:
> 
> ./python -m timeit -s "a='x'*10000;b='\u0102'*1000;c='\U00100000'" "a+b+c"
> 
> -> before:
> 100000 loops, best of 3: 14.9 usec per loop
> -> after:
> 100000 loops, best of 3: 9.19 usec per loop

Before going further with this, I'd suggest you have a look at your
compiler settings. Such optimizations are normally performed by the
compiler and don't need to be implemented in C, making maintenance
harder.

The fact that Windows doesn't exhibit the same performance difference
suggests that the optimizer is not using the same level or feature
set as on Linux. MSVC is at least as good at optimizing code as gcc,
often better.

I tested using memchr() when writing those "naive" loops. It turned
out that using memchr() was slower than using the direct loops. memchr()
is inlined by the compiler just like the direct loop and the generated
code for the direct version is often easier to optimize for the compiler
than the memchr() one, since it receives more knowledge about the used
data types.
msg145194 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-10-08 22:34
> Before going further with this, I'd suggest you have a look at your
> compiler settings.

They are set by the configure script:

gcc -pthread -c -Wno-unused-result -DNDEBUG -g -fwrapv -O3 -Wall
-Wstrict-prototypes    -I. -I./Include    -DPy_BUILD_CORE -o
Objects/unicodeobject.o Objects/unicodeobject.c

> Such optimizations are normally performed by the
> compiler and don't need to be implemented in C, making maintenance
> harder.

The fact that the glibc includes such optimization (in much more
sophisticated form) suggests to me that many compilers don't perform
these optimizations automically.

> I tested using memchr() when writing those "naive" loops.

memchr() is mentioned in another issue, #13134.

> memchr()
> is inlined by the compiler just like the direct loop

I don't think so. If you look at the glibc's memchr() implementation,
it's a sophisticated routine, not a trivial loop. Perhaps you're
thinking about memcpy().

> and the generated
> code for the direct version is often easier to optimize for the compiler
> than the memchr() one, since it receives more knowledge about the used
> data types.

?? Data types are fixed in the memchr() definition, there's no knowledge
to be gained by inlining.
msg145200 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2011-10-09 00:49
On Sat, Oct 8, 2011 at 5:34 PM, Antoine Pitrou <report@bugs.python.org> wrote:

> Antoine Pitrou <pitrou@free.fr> added the comment:
>
>> Before going further with this, I'd suggest you have a look at your
>> compiler settings.
>
> They are set by the configure script:
>
> gcc -pthread -c -Wno-unused-result -DNDEBUG -g -fwrapv -O3 -Wall
> -Wstrict-prototypes    -I. -I./Include    -DPy_BUILD_CORE -o
> Objects/unicodeobject.o Objects/unicodeobject.c
>
>> Such optimizations are normally performed by the
>> compiler and don't need to be implemented in C, making maintenance
>> harder.
>
> The fact that the glibc includes such optimization (in much more
> sophisticated form) suggests to me that many compilers don't perform
> these optimizations automically.

I agree.  This is more of an optimized runtime library problem than
code optimization problem.

>> I tested using memchr() when writing those "naive" loops.
>
> memchr() is mentioned in another issue, #13134.

Yeah, this conversation is really more relevant to issue13134, but I will
reply to these here anyway ....

>> memchr()
>> is inlined by the compiler just like the direct loop
>
> I don't think so. If you look at the glibc's memchr() implementation,
> it's a sophisticated routine, not a trivial loop. Perhaps you're
> thinking about memcpy().

Without link-time optimization enabled, I doubt the toolchain can
"inline" 'memchr'
in the traditional sense (i.e. inserting the body of the routine
inline).  Even if it could,
the inline heuristics would most likely choose not to.  I don't think we use LTO
with GCC, but I think we might with VC++.

GCC does something else called builtin folding that is more likely.
For example,
'memchr ("bca", 'c', 3)' gets replace with instructions to compute a pointer
index into "bca".  This won't happen in this case because all of the 'memchr'
arguments are all variable.

>> and the generated
>> code for the direct version is often easier to optimize for the compiler
>> than the memchr() one, since it receives more knowledge about the used
>> data types.
>
> ?? Data types are fixed in the memchr() definition, there's no knowledge
> to be gained by inlining.

I think what Marc-Andre is alluding to is that the first parameter of
'memchr' is 'void *'
which could (in theory) limit optimization opportunities.  Where as if
it knew that
the data being searched is a 'char *' or something it could take
advantage of that.
msg145205 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-10-09 03:06
Marc-Andre: gcc will normally not unroll loops, unless -funroll-loops is given on the command line. Then, it will unroll many loops, and do so with 8 iterations per outer loop. This typically causes significant code bloat, which is why unrolling is normally disabled and left to the programmer.

For those who want to experiment with this, I attach a C file with just the code in question. Compile this with your favorite compiler settings, and see what the compile generates. clang, on an x64 system, compiles the original loop into


LBB0_2:                                 ## =>This Inner Loop Header: Depth=1
        movzbl  (%rdi), %eax
        movw    %ax, (%rdx)
        incq    %rdi
        addq    $2, %rdx
        decq    %rsi
        jne     LBB0_2

and the unrolled loop into

LBB1_2:                                 ## %.lr.ph6
                                        ## =>This Inner Loop Header: Depth=1
        movzbl  (%rdi,%rcx), %r8d
        movw    %r8w, (%rdx)
        movzbl  1(%rdi,%rcx), %r8d
        movw    %r8w, 2(%rdx)
        movzbl  2(%rdi,%rcx), %r8d
        movw    %r8w, 4(%rdx)
        movzbl  3(%rdi,%rcx), %r8d
        movw    %r8w, 6(%rdx)
        addq    $8, %rdx
        addq    $4, %rcx
        cmpq    %rax, %rcx
        jl      LBB1_2
msg145252 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-10-09 11:33
Antoine Pitrou wrote:
> 
>> I tested using memchr() when writing those "naive" loops.
> 
> memchr() is mentioned in another issue, #13134.

Looks like I posted the comment to the wrong ticket.
msg145360 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-10-11 19:07
New changeset 5b077c962a16 by Antoine Pitrou in branch 'default':
Issue #13136: speed up conversion between different character widths.
http://hg.python.org/cpython/rev/5b077c962a16
msg145361 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-10-11 19:08
Patch committed. It is of course still not as fast as memcpy, but it's a small step towards improving performance.
History
Date User Action Args
2022-04-11 14:57:22adminsetgithub: 57345
2011-10-11 19:08:08pitrousetstatus: open -> closed
resolution: fixed
messages: + msg145361

stage: patch review -> resolved
2011-10-11 19:07:13python-devsetnosy: + python-dev
messages: + msg145360
2011-10-09 11:33:30lemburgsetmessages: + msg145252
2011-10-09 03:06:10loewissetfiles: + unroll.c
nosy: + loewis
messages: + msg145205

2011-10-09 00:49:51meador.ingesetnosy: + meador.inge
messages: + msg145200
2011-10-08 22:34:50pitrousetmessages: + msg145194
2011-10-08 22:29:12lemburgsetnosy: + lemburg
messages: + msg145193
2011-10-08 22:18:30pitroucreate