This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: speed up finding of one-character strings
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: lemburg, loewis, pitrou, python-dev, vstinner
Priority: normal Keywords: patch

Created on 2011-10-08 20:29 by pitrou, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
memchr.patch pitrou, 2011-10-08 20:29 review
memchr2.patch pitrou, 2011-10-08 20:55 review
Messages (11)
msg145186 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-10-08 20:29
In stringlib/fastsearch, instead of using our own loops, we can use memchr() (and, if available, memrchr()) when looking for one-character strings. memchr() is generally quite optimized; on this Linux/glibc machine, I get speedups of 5x-10x:

./python -m timeit -s "large='a'*10000+'b'" "large.find('b')"
-> before: 10.5 usec per loop
-> after: 0.889 usec per loop

./python -m timeit -s "large='b'+'a'*10000" "large.rfind('b')"
-> before: 7.06 usec per loop
-> after: 1.94 usec per loop
msg145187 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-10-08 20:39
With MSVC there seems to be no difference, meaning Windows probably has a naïve memchr() implementation.
msg145189 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-10-08 20:55
New patch with a couple of tests.
msg145251 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-10-09 11:23
[Posted the reply to the right ticket; see issue13136 for the original
 post to the wrong ticket]

Antoine Pitrou wrote:
> 
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
>> Before going further with this, I'd suggest you have a look at your
>> compiler settings.
> 
> They are set by the configure script:
> 
> gcc -pthread -c -Wno-unused-result -DNDEBUG -g -fwrapv -O3 -Wall
> -Wstrict-prototypes    -I. -I./Include    -DPy_BUILD_CORE -o
> Objects/unicodeobject.o Objects/unicodeobject.c

Which gcc version are you using ?
Is it possible that you have -fno-builtin enabled ?

>> Such optimizations are normally performed by the
>> compiler and don't need to be implemented in C, making maintenance
>> harder.
> 
> The fact that the glibc includes such optimization (in much more
> sophisticated form) suggests to me that many compilers don't perform
> these optimizations automically.

When using gcc, the glibc functions are usually not used at all,
since gcc comes with a (rather large) set of builtins which are
inlined directly, if you have optimizations enabled and inlining
is found to be more efficient than calling the glibc function:

http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

glibc includes the optimized versions since it has to implement
C library (obviously) and for cases where inlining does not
happen.

>> I tested using memchr() when writing those "naive" loops.
> 
> memchr() is mentioned in another issue, #13134.
> 
>> memchr()
>> is inlined by the compiler just like the direct loop
> 
> I don't think so. If you look at the glibc's memchr() implementation,
> it's a sophisticated routine, not a trivial loop. Perhaps you're
> thinking about memcpy().

See http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html and the
assembler output. If it's not inlined, then something must be
preventing this and it would be good to find out why.

>> and the generated
>> code for the direct version is often easier to optimize for the compiler
>> than the memchr() one, since it receives more knowledge about the used
>> data types.
> 
> ?? Data types are fixed in the memchr() definition, there's no knowledge
> to be gained by inlining.

There is: the compiler will have alignement information available and
can also benefit from using registers instead of the stack, knowledge
about processor cache lines, etc. Such information is lost when calling
a function. The function call itself will also create some overhead.

BTW: You should not only test the optimization with long strings, but also
with short ones (e.g. 2-15 chars) - which is a much more common case
in practice.
msg145279 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-10-09 21:21
> >> Before going further with this, I'd suggest you have a look at your
> >> compiler settings.
> > 
> > They are set by the configure script:
> > 
> > gcc -pthread -c -Wno-unused-result -DNDEBUG -g -fwrapv -O3 -Wall
> > -Wstrict-prototypes    -I. -I./Include    -DPy_BUILD_CORE -o
> > Objects/unicodeobject.o Objects/unicodeobject.c
> 
> Which gcc version are you using ?

$ gcc -v
Utilisation des specs internes.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-mageia-linux-gnu/4.5.2/lto-wrapper
Target: x86_64-mageia-linux-gnu
Configuré avec: ../configure --prefix=/usr --libexecdir=/usr/lib
--with-slibdir=/lib64 --with-bugurl=http://bugs.mageia.org/
--mandir=/usr/share/man --infodir=/usr/share/info
--enable-checking=release --enable-languages=c,c
++,ada,fortran,objc,obj-c++,java --build=x86_64-mageia-linux-gnu
--host=x86_64-mageia-linux-gnu --with-cpu=generic --with-system-zlib
--enable-threads=posix --enable-shared --enable-objc-gc
--enable-long-long --enable-__cxa_atexit --disable-libunwind-exceptions
--enable-clocale=gnu --enable-java-awt=gtk
--with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre
--with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-gtk-cairo
--disable-libjava-multilib --enable-ssp --disable-libssp
--disable-werror --with-ppl --with-cloog
--with-python-dir=/lib/python2.7/site-packages --enable-lto
Modèle de thread: posix
gcc version 4.5.2 (GCC) 

> Is it possible that you have -fno-builtin enabled ?

Why would it be enabled? It is not on the command line.

> >> Such optimizations are normally performed by the
> >> compiler and don't need to be implemented in C, making maintenance
> >> harder.
> > 
> > The fact that the glibc includes such optimization (in much more
> > sophisticated form) suggests to me that many compilers don't perform
> > these optimizations automically.
> 
> When using gcc, the glibc functions are usually not used at all,
> since gcc comes with a (rather large) set of builtins which are
> inlined directly, if you have optimizations enabled and inlining
> is found to be more efficient than calling the glibc function:

What would that change? Whether the optimized memchr() comes from gcc or
the glibc is not relevant here.

> There is: the compiler will have alignement information available and
> can also benefit from using registers instead of the stack, knowledge
> about processor cache lines, etc. Such information is lost when calling
> a function. The function call itself will also create some overhead.
> 
> BTW: You should not only test the optimization with long strings, but also
> with short ones (e.g. 2-15 chars) - which is a much more common case
> in practice.

With very short strings, the runtimes tend to be identical, which is
understandable.
msg145357 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-10-11 18:44
New changeset af4a89f4f466 by Antoine Pitrou in branch 'default':
Issue #13134: optimize finding single-character strings using memchr
http://hg.python.org/cpython/rev/af4a89f4f466
msg145359 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-10-11 18:45
I went ahead and committed.
msg145367 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-10-11 21:23
changeset:   72874:bfd3fcfb02f3
user:        Victor Stinner <victor.stinner@haypocalc.com>
date:        Tue Oct 11 23:22:22 2011 +0200
files:       Objects/stringlib/asciilib.h Objects/stringlib/fastsearch.h Objects/stringlib/stringdefs.h Objects/stringlib/ucs1lib.h Objects/stri
description:
Fix fastsearch for UCS2 and UCS4

 * If needle is 0, try (p[0] >> 16) & 0xff for UCS4
 * Disable fastsearch_memchr_1char() if needle is zero for UCS2 and UCS4
msg145451 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-10-13 12:22
I think the >1 character sizes are overly complex in this patch, and still memchr isn't typically used for them. So I suggest to simplify the code and restrict it to 1-byte chars only.
msg145454 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-10-13 14:21
> I think the >1 character sizes are overly complex in this patch, and
> still memchr isn't typically used for them. So I suggest to simplify
> the code and restrict it to 1-byte chars only.

I would rather propose to simplify the needle heuristic and only use it
when the lower byte is non-zero. A properly optimized memchr() (as in
the glibc / gcc) is definitely faster than our naïve loop.
msg145467 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-10-13 16:05
> I would rather propose to simplify the needle heuristic and only use it
> when the lower byte is non-zero. A properly optimized memchr() (as in
> the glibc / gcc) is definitely faster than our naïve loop.

That would be fine as well. Not sure if a heuristics would be needed in
this case at all: it's probably uncommon that you search for a single
character whose lower-half is 0 (most likely you are then searching for
the null character, and not, say, LATIN CAPITAL LETTER A WITH DOUBLE
GRAVE).

In any case, I still think that the heuristics (if any) needs to be
explained better, and needs some justification in the first place.
History
Date User Action Args
2022-04-11 14:57:22adminsetgithub: 57343
2011-10-13 16:05:32loewissetmessages: + msg145467
2011-10-13 14:21:12pitrousetmessages: + msg145454
2011-10-13 12:22:59loewissetmessages: + msg145451
2011-10-11 21:23:26vstinnersetmessages: + msg145367
2011-10-11 18:45:51pitrousetstatus: open -> closed
resolution: fixed
messages: + msg145359

stage: resolved
2011-10-11 18:44:28python-devsetnosy: + python-dev
messages: + msg145357
2011-10-09 21:21:56pitrousetmessages: + msg145279
2011-10-09 11:23:03lemburgsetnosy: + lemburg
messages: + msg145251
2011-10-08 20:55:11pitrousetfiles: + memchr2.patch

messages: + msg145189
2011-10-08 20:39:45pitrousetmessages: + msg145187
2011-10-08 20:29:12pitrousetfiles: + memchr.patch
keywords: + patch
2011-10-08 20:29:03pitroucreate