classification
Title: Faster ascii decoding
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.3
process
Status: closed Resolution: duplicate
Dependencies: Superseder: Amazingly faster UTF-8 decoding
View: 14738
Assigned To: Nosy List: jcea, pitrou, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2012-03-26 20:56 by serhiy.storchaka, last changed 2012-05-11 21:58 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
decode_ascii.patch serhiy.storchaka, 2012-03-26 20:56 review
decode_ascii_2.patch serhiy.storchaka, 2012-03-26 22:53 Simplified patch only for sizeof(long) <= sizeof(void *) review
decode_ascii_bench.c serhiy.storchaka, 2012-03-27 21:17
decode_ascii_compare.ods serhiy.storchaka, 2012-03-27 21:36
Messages (16)
msg156870 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-03-26 20:56
The proposed patch accelerates ascii decoding in a particular case, when the alignment of the input data coincides with the alignment of data in PyASCIIObject. This is a common case on 32-bit platforms. I did not check whether the patch have any effect on 64-bit platforms.

Without patch:
$ ./python -m timeit -n 1000 -s 'enc = "ascii"; import codecs; d = codecs.getdecoder(enc); x = ("\u0020" * 1000000).encode(enc)' 'd(x)'
1000 loops, best of 3: 1.93 msec per loop
$ ./python -m timeit -n 10000 -s 'enc = "ascii"; import codecs; d = codecs.getdecoder(enc); x = ("\u0020" * 100000).encode(enc)' 'd(x)'
10000 loops, best of 3: 59.4 usec per loop

With patch:
$ ./python -m timeit -n 1000 -s 'enc = "ascii"; import codecs; d = codecs.getdecoder(enc); x = ("\u0020" * 1000000).encode(enc)' 'd(x)'
1000 loops, best of 3: 1.46 msec per loop
$ ./python -m timeit -n 10000 -s 'enc = "ascii"; import codecs; d = codecs.getdecoder(enc); x = ("\u0020" * 100000).encode(enc)' 'd(x)'
10000 loops, best of 3: 35.6 usec per loop
msg156872 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-03-26 22:03
I would like to have someone tested it on a 64-bit platform. May be worth a copy smaller blocks, not sizeof(long) but sizeof(void *)? This guaranteed the alignment of the destination (and very likely the source).
msg156873 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-03-26 22:09
Results on a 64-bit Linux box, Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz with 12 GB of RAM:

Unpatched: 10000 loops, best of 3: 150 usec per loop
Patched:   10000 loops, best of 3: 80.2 usec per loop
msg156874 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-03-26 22:19
+    if (((size_t) p & LONG_PTR_MASK) == ((size_t) q & LONG_PTR_MASK)) {

This test looks. I think that it should be replaced by:

    if (!((size_t) q & LONG_PTR_MASK)) {
msg156876 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-03-26 22:47
> +    if (((size_t) p & LONG_PTR_MASK) == ((size_t) q & LONG_PTR_MASK)) {
> This test looks. I think that it should be replaced by:
>     if (!((size_t) q & LONG_PTR_MASK)) {

"if (!((size_t) p & LONG_PTR_MASK)) {" if sizeof(long) <= sizeof(void *). And rewrite loop.

Now I got the idea that it is now difficult to find a platform for which sizeof(long) > sizeof(void *).
msg156880 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-03-26 23:16
+#if SIZEOF_LONG <= SIZEOF_VOID_P
+    if (!((size_t) p & LONG_PTR_MASK)) {

I wrote "q", not "p". You have to check p and q alignement to be able
to dereference p and q pointers.

sizeof(long) <= sizeof(void*) is always true.
msg156896 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-03-27 06:32
> +#if SIZEOF_LONG <= SIZEOF_VOID_P
> +    if (!((size_t) p & LONG_PTR_MASK)) {
> 
> I wrote "q", not "p". You have to check p and q alignement to be able
> to dereference p and q pointers.

Initial q (destination) is always pointer-aligned, because PyASCIIObject is 
pointer-aligned. If PyASCIIObject not aligned, then on this platform alignment 
does not matter. But initial p (source) can be non-aligned. If the initial p 
and q are not equally aligned, then they will not be able to be aligned in the 
future, and the whole loop shall be omitted.

> sizeof(long) <= sizeof(void*) is always true.

Some memory models in I8086 processor had a 4-byte long and a 2-byte pointer. 
It is already in the past, but some modern processors can be 8-byte long and a 
4-byte pointer. I do not know, whether there are such. If you think this 
condition is unnecessary, we can remove it.
msg156899 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-03-27 08:46
>> +#if SIZEOF_LONG <= SIZEOF_VOID_P
>> +    if (!((size_t) p & LONG_PTR_MASK)) {
>>
>> I wrote "q", not "p". You have to check p and q alignement to be able
>> to dereference p and q pointers.
>
> Initial q (destination) is always pointer-aligned, because PyASCIIObject is
> pointer-aligned. If PyASCIIObject not aligned, then on this platform alignment
> does not matter.

q is not the address of the Unicode string, but the address of the
data following the Unicode structure in memory. Strings created by
PyUnicode_New() are composed on one unique memory block: {structure,
data}. It's safer to check that q is aligned to sizeof(long). In
Python 3.2, it was different because the string data was always a
separated memory block (as "not ready" strings in Python 3.3).

We may change PyASCIIObject later to pack fields (to reduce memory consumption).
msg156901 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-03-27 10:34
> q is not the address of the Unicode string, but the address of the
> data following the Unicode structure in memory. Strings created by
> PyUnicode_New() are composed on one unique memory block: {structure,
> data}.

I know all that.

#define _PyUnicode_COMPACT_DATA(op)                     \
    (PyUnicode_IS_ASCII(op) ?                   \
     ((void*)((PyASCIIObject*)(op) + 1)) :              \
     ((void*)((PyCompactUnicodeObject*)(op) + 1)))

q is ((void*)((PyASCIIObject*)(op) + 1)). (PyASCIIObject*)(op) + 1 is pointer to PyASCIIObject and has same alignment as PyASCIIObject. PyASCIIObject is aligned to sizeof(void *) 
because it starts with void * field. Consequently, q is aligned to sizeof(void *). It does not depend on the number and the size of the fields in PyASCIIObject, except for the 
first one.

Of course, if _PyUnicode_COMPACT_DATA definition is changed, it will cease to be true. Then apply my first patch, which may be a bit less effective for short strings 
(performance for short strings is bad measureable through Python). However, for short strings, we can put a size limit:

if (size >= 2 * SIZEOF_LONG && ((size_t) p & LONG_PTR_MASK) == ((size_t) q & LONG_PTR_MASK)) {
msg156914 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-03-27 12:03
New tests. I'm not conviced by the patch: it slows down the decoder for "short" strings. I don't understand which kind of ASCII encoded strings (specific length or content?) are optimized by the patch.

Unpatched:

$ ./python -m timeit -n 50000 -r 100 -s 'data=open("README", "r").read().encode("ascii")' 'data.decode("ASCII")'
50000 loops, best of 100: 1.41 usec per loop

$ ./python -m timeit -n 1000 -s 'import codecs; d = codecs.getdecoder("ascii"); x = bytes(range(128))*10' 'd(x)'
1000 loops, best of 3: 0.564 usec per loop

$ ./python -m timeit -n 1000 -s 'import codecs; d = codecs.getdecoder("ascii"); x = bytes(range(128))*1000' 'd(x)'
1000 loops, best of 3: 24.4 usec per loop

$ ./python -m timeit -n 10 -s 'import codecs; d = codecs.getdecoder("ascii"); x = bytes(range(128))*100000' 'd(x)'
10 loops, best of 3: 10.9 msec per loop

$ ./python -m timeit -n 1000 -s 'enc = "ascii"; import codecs; d = codecs.getdecoder(enc); x = ("\u0020" * 1000000).encode(enc)' 'd(x)'
1000 loops, best of 3: 722 usec per loop

Patched:

$ ./python -m timeit -n 50000 -r 100 -s 'data=open("README", "r").read().encode("ascii")' 'data.decode("ASCII")'
50000 loops, best of 100: 1.74 usec per loop

$ ./python -m timeit -n 1000 -s 'import codecs; d = codecs.getdecoder("ascii"); x = bytes(range(128))*10' 'd(x)'
1000 loops, best of 3: 0.597 usec per loop

$ ./python -m timeit -n 1000 -s 'import codecs; d = codecs.getdecoder("ascii"); x = bytes(range(128))*1000' 'd(x)'
1000 loops, best of 3: 27.3 usec per loop

$ ./python -m timeit -n 10 -s 'import codecs; d = codecs.getdecoder("ascii"); x = bytes(range(128))*100000' 'd(x)'
10 loops, best of 3: 8.32 msec per loop

$ ./python -m timeit -n 1000 -s 'enc = "ascii"; import codecs; d = codecs.getdecoder(enc); x = ("\u0020" * 1000000).encode(enc)' 'd(x)'
1000 loops, best of 3: 479 usec per loop
msg156922 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-03-27 13:42
> New tests. I'm not conviced by the patch: it slows down the decoder for
> "short" strings. I don't understand which kind of ASCII encoded strings
> (specific length or content?) are optimized by the patch.

May be you forgot the -r? Add -r 100 or -r 1000 and run tests a few times to evaluate the dispersion.

I get the acceleration in all cases. For `bytes(range(128))*10` it barely exceeds variation (second patch is slightly faster than first one), for `bytes(range(128))*1000` it is 1.7x 
faster. For more short strings (under "short" I knew the length of the order of 10) measurement is thus quite impossible.

This may also depend on the processor and compiler. I have AMD Athlon 64 X2 4600+ (2-core, 2.4GHz, 512 KB cache) and use gcc 4.4.3 on 32-bit Linux.
msg156923 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-03-27 13:57
> This may also depend on the processor and compiler. I have AMD Athlon
> 64 X2 4600+ (2-core, 2.4GHz, 512 KB cache) and use gcc 4.4.3 on 32-bit 
> Linux.

Then by choosing a string length that exceeds the L2 cache size, you may have found an ideal case for your optimization. Basically you're doing the error checking and the memcpy in one single pass.

Honestly I'm not sure that's worth the hassle. ASCII-decoding is already very fast for shorter strings.

(no need to pass "-n" or "-r" to timeit, it will figure out adequate numbers by itself)
msg156956 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-03-27 21:17
Python script too rough tool to measure decoding performance on short strings. To do this I used C.

The benchmark scheme is as follows. Taken a big enough chunk of memory to reduce effect of processor cache. This area is splitted into many pieces with the same offset over long aligned block. Then measured a time of decoding all pieces of a certain size with a certain offset. Calculated an average time (ms) and decoding speed (MB/s).

Run benchmark:
gcc -Wall -O3 -I Include/ -I . -Xlinker -export-dynamic decode_ascii_bench.c libpython3.3m.a -lpthread -ldl -lutil -lm -lrt -o decode_ascii_bench && ./decode_ascii_bench
msg156957 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-03-27 21:36
As you can see, the unpatched code does not depend on the alignment. With patches aligned data (which constitute the vast majority, if not all) decoded much faster and non-aligned data decoded sometimes slightly slower. Time of decoding 2-10-bytes practically does not depend on the string length, most of the time (0.1 ms) occupy the overhead of function calls and objects creation and destruction. But even in this case, the patches show a steady increase in performance. When the overhead costs are reduced advantage becomes stronger. For short strings the second patch is better the first patch, as expected.
msg160445 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-11 19:31
Since the patch commited as part of new UTF-8 decoder, this issue can be closed (issue14738).
msg160455 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-11 21:58
Okay, thank you!
History
Date User Action Args
2012-05-11 21:58:22pitrousetdependencies: - Amazingly faster UTF-8 decoding
superseder: Amazingly faster UTF-8 decoding
2012-05-11 21:58:14pitrousetstatus: open -> closed
messages: + msg160455

dependencies: + Amazingly faster UTF-8 decoding
resolution: duplicate
stage: resolved
2012-05-11 19:31:57serhiy.storchakasetmessages: + msg160445
2012-04-01 12:14:24jceasetnosy: + jcea
2012-03-27 21:36:24serhiy.storchakasetfiles: + decode_ascii_compare.ods

messages: + msg156957
2012-03-27 21:17:42serhiy.storchakasetfiles: + decode_ascii_bench.c

messages: + msg156956
2012-03-27 13:57:50pitrousetmessages: + msg156923
2012-03-27 13:42:55serhiy.storchakasetmessages: + msg156922
2012-03-27 12:03:34vstinnersetmessages: + msg156914
2012-03-27 10:34:21serhiy.storchakasetmessages: + msg156901
2012-03-27 08:46:50vstinnersetmessages: + msg156899
2012-03-27 06:32:29serhiy.storchakasetmessages: + msg156896
2012-03-26 23:16:05vstinnersetmessages: + msg156880
2012-03-26 22:53:27serhiy.storchakasetfiles: + decode_ascii_2.patch
2012-03-26 22:47:14serhiy.storchakasetmessages: + msg156876
2012-03-26 22:19:52vstinnersetmessages: + msg156874
2012-03-26 22:09:35vstinnersetnosy: + vstinner
messages: + msg156873
2012-03-26 22:03:52serhiy.storchakasetmessages: + msg156872
2012-03-26 21:12:27vstinnersetnosy: + pitrou
2012-03-26 20:56:17serhiy.storchakacreate