classification
Title: str==str: compare the first character before calling memcmp()
Type: Stage:
Components: Versions: Python 3.4
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: christian.heimes, eric.snow, lemburg, mmokrejs, pitrou, rhettinger, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2013-04-03 21:29 by vstinner, last changed 2015-03-18 11:06 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
unicode_eq.patch vstinner, 2013-04-03 21:29 review
benchmark vstinner, 2013-04-03 21:33
bench_unicode_eq.py vstinner, 2013-04-03 21:58
benchmark2 vstinner, 2013-04-03 22:10
dont_compare_first_last.patch vstinner, 2013-04-09 22:13 review
Messages (23)
msg185956 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-03 21:29
In Python 3.4, str==str is implemented by calling memcmp().

unicode_eq() function, used by dict and set types, checks the first byte before calling memcmp(). bytes==bytes uses the same check.

Py_UNICODE_MATCH macro checks the first *and* last character before calling memcmp() since this commit:
---
changeset:   38242:0de9a789de39
branch:      legacy-trunk
user:        Fredrik Lundh <fredrik@pythonware.com>
date:        Tue May 23 10:10:57 2006 +0000
files:       Include/unicodeobject.h
description:
needforspeed: check first *and* last character before doing a full memcmp
---

Attached patch changes str==str to check the first and last character before calling memcmp(). It might reduce the overhead of a C function call, but it is much faster when comparing two different strings of the same length with a common prefix (but a different suffix).

The patch merges also unicode_compare_eq() and unicode_eq() to use the same code for str, dict and set.

We may use the same optimization on byte strings.

See also #16321.
msg185957 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-03 21:33
According to my benchmark, performances are almost the same with the patch. The major difference is on comparing two strings longer than 10 characters, of the same length, with a common prefix but a different suffix. See attached benchmark for the result.
msg185965 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-03 21:58
Attach the benchmark script.
msg185970 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-03 22:10
benchmark2: Results on a slower computer. Comparing equal strings is much faster with the patch. Example:

equal, 'A', 1000000          |  945 us (*) | 1.25 ms (+32%)

I don't understand why the patch makes the comparaison much slower, since most time is supposed to be spend in memcmp()?

Is it because I starts at the second character to compare strings, instead of the first character? Memory alignment issue?
msg186007 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-04-04 06:21
> I don't understand why the patch makes the comparaison much slower,
> since most time is supposed to be spend in memcmp()?

Because reading the last character evicts useful data from the CPU cache, just before memcmp() reads it again from memory?

In other words, I'm not convinced this is a useful heuristic.
msg186012 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-04 08:33
> In other words, I'm not convinced this is a useful heuristic.

Me neither, but we should use the same optimization strategy for all
functions. If we don't compare first and/or last character for
str==str, we should do the same for bytes==bytes and Py_UNICODE_MATCH.

str==str performances depends on the compiler and the libc. So
performances may be very different on Windows, I will try to run the
benchmark on Windows.

GCC has also a known performance issue on memcmp. Its builtin memcmp
implementation is slower than glibc >= 2.10, especiall glibc >= 2.13.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052
"GCC can't beat glibc if function call overhead is low."

2013/4/4 Antoine Pitrou <report@bugs.python.org>:
>
> Antoine Pitrou added the comment:
>
>> I don't understand why the patch makes the comparaison much slower,
>> since most time is supposed to be spend in memcmp()?
>
> Because reading the last character evicts useful data from the CPU cache, just before memcmp() reads it again from memory?
>
> In other words, I'm not convinced this is a useful heuristic.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue17628>
> _______________________________________
msg186016 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-04-04 09:09
On 04.04.2013 10:33, STINNER Victor wrote:
>>> I don't understand why the patch makes the comparaison much slower,
>>> since most time is supposed to be spend in memcmp()?
>>
>> Because reading the last character evicts useful data from the CPU cache, just before memcmp() reads it again from memory?
>>
>> In other words, I'm not convinced this is a useful heuristic.

Same here. The heuristic may work for short strings that easily fit
into the CPU cache, but as soon as you use it on longer strings,
this will result in much slower comparisons.

Whether this results in a speedup or not also depends a lot
on the domain of where you need to run comparisons, e.g. if you have
run the heuristic on Python's special method names (such as "__init__")
it won't give you any benefit. OTOH, it's easy to construct strings
that benefit a lot from it :-)

Something that typically works well in practice is to inline
the comparison of the first few characters and then call memcmp()
on the remaining ones. This avoids cache corruption and safes
a few cycles setup costs for memcmp() for short strings.
msg186017 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-04 09:21
By the way, my initial concern was to merge unicode_compare_eq() and
unicode_eq() functions.

unicode_compare_eq() only uses memcmp(), whereas unicode_eq() checks
if the first byte is different before calling memcmp(). So the
question is to decide which one is faster ;-)
msg186019 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-04-04 09:30
On 04.04.2013 11:21, STINNER Victor wrote:
> 
> STINNER Victor added the comment:
> 
> By the way, my initial concern was to merge unicode_compare_eq() and
> unicode_eq() functions.
> 
> unicode_compare_eq() only uses memcmp(), whereas unicode_eq() checks
> if the first byte is different before calling memcmp(). So the
> question is to decide which one is faster ;-)

The second one was faster a couple of years ago. Things may have
changed since then (better compilers, CPUs, etc.). Perhaps you
could run a benchmark with increasing sizes of strings, one set
with mismatches in the last character, the other with matching
strings.
msg186044 - (view) Author: Eric Snow (eric.snow) * (Python committer) Date: 2013-04-04 17:00
> Marc-Andre Lemburg added the comment:
> Same here. The heuristic may work for short strings that easily fit
> into the CPU cache, but as soon as you use it on longer strings,
> this will result in much slower comparisons.

When testing both, would it help to test the end of the string before the beginning?  I'd expect that be more likely to leave the beginning in the cache for any subsequent memcmp() call.
msg186045 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-04-04 17:30
On 04.04.2013 19:00, Eric Snow wrote:
> 
> Eric Snow added the comment:
> 
>> Marc-Andre Lemburg added the comment:
>> Same here. The heuristic may work for short strings that easily fit
>> into the CPU cache, but as soon as you use it on longer strings,
>> this will result in much slower comparisons.
> 
> When testing both, would it help to test the end of the string before the beginning?  I'd expect that be more likely to leave the beginning in the cache for any subsequent memcmp() call.

Again: this depends a lot on what strings you are dealing with. If
you are comparing strings that only vary in the first few characters,
testing the last character first would not be ideal :-)

Given that CPUs are optimized to read ahead in memory, it's always
better to avoid jumping around too much when accessing memory.

http://en.wikipedia.org/wiki/CPU_cache
http://en.wikipedia.org/wiki/Locality_of_reference
http://lwn.net/Articles/252125/

Ideally, you want to stay within a cache line, typically 64 bytes.
msg186231 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-04-07 17:31
Note that unicode_eq() always called after identity check and hash check. I.e. identity check in Victor's patch is redundant and unicode_eq() called only for strings which have the same hash. The probability to have the same first byte and be equal is a great.

unicode_compare_eq() and unicode_eq() are designed for different purposes. They cannot be just merged.

As the optimization for unicode_eq(), I would have suggested a checking of first machine words instead of first bytes, but this trick is too dirty.
msg186458 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-09 22:13
Because most people agree that checking first and/or last byte/character is not a good idea (may be slower), here is a new patch removing code checking first/last byte or character in bytes_richcompare() and unicode_eq().

It removes the usage of the "register" keyword, I read that modern compilers generate worse code when this keyword is used. Register allocators of modern compilers known better than you which variables should use a register.
msg186464 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-04-09 22:24
> Because most people agree that checking first and/or last
> byte/character is not a good idea (may be slower), here is a new patch
> removing code checking first/last byte or character in
> bytes_richcompare() and unicode_eq().

You misunderstood. Checking the first byte is ok, it's checking the last
byte which can misfire.

> It removes the usage of the "register" keyword, I read that modern
> compilers generate worse code when this keyword is used. Register
> allocators of modern compilers known better than you which variables
> should use a register.

I think you should open a separate issue to remove all instances of the
"register" keyword in the code base. It has become useless, any modern
compiler will allocate registers by itself, happily ignoring any
"register" hint.
msg186733 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-04-13 15:30
Some crazy ideas.

Try something like this:

    #define BLOCK unsigned long
    if (size >= sizeof(BLOCK)) {
        if (*(BLOCK*)data1 != *(BLOCK*)data2)
            return 0;
        return (memcmp((unsigned char*)data1 + sizeof(BLOCK),
                       (unsigned char*)data2 + sizeof(BLOCK), size) == 0);
    }
    if (*(unsigned char*)data1 != *(unsigned char*)data2)
        return 0;
    return (memcmp(data1, data2, size) == 0);

Or may be unroll memcmp for small size:

    switch (size) {
#if SIZEOF_LONG == 8
        case 7:
            ...
#endif
        case 3:
            ...
        case 2:
            if (((unsigned char*)data1)[1] != ((unsigned char*)data2)[1])
                return 0;
        case 1:
            if (((unsigned char*)data1)[0] != ((unsigned char*)data2)[0])
                return 0;
        case 0:
            return 1;
        default:
            // case for size >= sizeof(BLOCK)
    }
msg186779 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-13 18:21
You must check that data is aligned.

Did you run a benchmark?
msg195234 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-08-15 05:33
The dont-compare-first-last patch looks about right.

The "if (len == 0) return 1;" shortcut perhaps should be taken out.  It makes the common case pay (if only slightly) for the rare case (which of course, never gets predicted).  This whole code block gets inlined in the very tight inner loops of the set/dict lookkey functions -- it should be as thin as possible.

One other thing to take a look at is the "PyUnicode_GET_LENGTH(a) * PyUnicode_KIND(a)" expression.  The disassembly shows an imulq instruction rather than the usual scaled LEA computation.  I don't know if this can be sped-up by a predictable branch for the common case.
msg195236 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-08-15 07:24
Perhaps it worth manually inline unicode_eq() in these tight inner loops. Then we can move "PyUnicode_GET_LENGTH(a) * PyUnicode_KIND(a)" out of the loop.
msg195240 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-08-15 09:22
In issue #18719, Raymond modified Python 2.7, but he didn't touch the following macro:

#define Py_UNICODE_MATCH(string, offset, substring) \
    ((*((string)->str + (offset)) == *((substring)->str)) && \
    ((*((string)->str + (offset) + (substring)->length-1) == *((substring)->str + (substring)->length-1))) && \
     !memcmp((string)->str + (offset), (substring)->str, (substring)->length*sizeof(Py_UNICODE)))

It was said that looking for the last character before calling memcmp() is inefficient for the CPU cache. This macro should also be modified.
msg195241 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-08-15 09:28
It's also bad for memory read performance if the string is rather long. The memory controller performs best when code reads memory sequential. The talk http://www.youtube.com/watch?v=MC1EKLQ2Wmg about mythbusting modern hardware sums it up real nice.
msg195419 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-08-16 21:08
FWIW, there are two distinct issues.  As everyone has noted here, accessing memory in non-sequential order is a performance killer.

The other issue (the one I was working on) is that early-out first-char or last-char tests are a waste (almost never executed) if we already know that the string hashes match (i.e. in the string equality calls from the  dict and set lookup routines).  If the hashes match, the odds are 2**64 to 1 if favor of the strings being equal.
msg196690 - (view) Author: Martin Mokrejs (mmokrejs) Date: 2013-09-01 00:04
Regarding benchmarking and code performance inspection, maybe you want to try on your linux box:

perf top
perf stat /usr/bin/python mytest.py

http://perf.wiki.kernel.org/
msg238417 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-03-18 11:06
The benefit of any change is unclear. I'm not more interested to work on such optimization, so I just close the issue.
History
Date User Action Args
2015-03-18 11:06:35vstinnersetstatus: open -> closed
resolution: out of date
messages: + msg238417
2013-09-01 00:04:15mmokrejssetnosy: + mmokrejs
messages: + msg196690
2013-08-16 21:08:42rhettingersetmessages: + msg195419
2013-08-15 09:28:36christian.heimessetmessages: + msg195241
2013-08-15 09:22:11vstinnersetmessages: + msg195240
2013-08-15 07:24:22serhiy.storchakasetmessages: + msg195236
2013-08-15 05:33:49rhettingersetnosy: + rhettinger
messages: + msg195234
2013-07-05 23:25:52christian.heimessetnosy: + christian.heimes
2013-06-05 00:35:47vstinnersettitle: str==str: compare the first and last character before calling memcmp() -> str==str: compare the first character before calling memcmp()
2013-04-13 18:21:31vstinnersetmessages: + msg186779
2013-04-13 15:30:12serhiy.storchakasetmessages: + msg186733
2013-04-09 22:24:10pitrousetmessages: + msg186464
2013-04-09 22:13:49vstinnersetfiles: + dont_compare_first_last.patch

messages: + msg186458
2013-04-07 17:31:57serhiy.storchakasetmessages: + msg186231
2013-04-04 17:30:09lemburgsetmessages: + msg186045
2013-04-04 17:00:13eric.snowsetnosy: + eric.snow
messages: + msg186044
2013-04-04 09:30:40lemburgsetmessages: + msg186019
2013-04-04 09:21:43vstinnersetmessages: + msg186017
2013-04-04 09:09:36lemburgsetnosy: + lemburg
messages: + msg186016
2013-04-04 08:33:11vstinnersetmessages: + msg186012
2013-04-04 06:21:07pitrousetmessages: + msg186007
2013-04-03 22:10:33vstinnersetfiles: + benchmark2

messages: + msg185970
2013-04-03 21:58:19vstinnersetfiles: + bench_unicode_eq.py

messages: + msg185965
2013-04-03 21:33:03vstinnersetfiles: + benchmark

messages: + msg185957
2013-04-03 21:29:12vstinnercreate