New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster ascii decoding #58627
Comments
The proposed patch accelerates ascii decoding in a particular case, when the alignment of the input data coincides with the alignment of data in PyASCIIObject. This is a common case on 32-bit platforms. I did not check whether the patch have any effect on 64-bit platforms. Without patch: With patch: |
I would like to have someone tested it on a 64-bit platform. May be worth a copy smaller blocks, not sizeof(long) but sizeof(void *)? This guaranteed the alignment of the destination (and very likely the source). |
Results on a 64-bit Linux box, Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz with 12 GB of RAM: Unpatched: 10000 loops, best of 3: 150 usec per loop |
+ if (((size_t) p & LONG_PTR_MASK) == ((size_t) q & LONG_PTR_MASK)) { This test looks. I think that it should be replaced by: if (!((size_t) q & LONG_PTR_MASK)) { |
"if (!((size_t) p & LONG_PTR_MASK)) {" if sizeof(long) <= sizeof(void *). And rewrite loop. Now I got the idea that it is now difficult to find a platform for which sizeof(long) > sizeof(void *). |
+#if SIZEOF_LONG <= SIZEOF_VOID_P I wrote "q", not "p". You have to check p and q alignement to be able sizeof(long) <= sizeof(void*) is always true. |
Initial q (destination) is always pointer-aligned, because PyASCIIObject is
Some memory models in I8086 processor had a 4-byte long and a 2-byte pointer. |
q is not the address of the Unicode string, but the address of the We may change PyASCIIObject later to pack fields (to reduce memory consumption). |
I know all that. #define _PyUnicode_COMPACT_DATA(op) \
(PyUnicode_IS_ASCII(op) ? \
((void*)((PyASCIIObject*)(op) + 1)) : \
((void*)((PyCompactUnicodeObject*)(op) + 1))) q is ((void*)((PyASCIIObject*)(op) + 1)). (PyASCIIObject*)(op) + 1 is pointer to PyASCIIObject and has same alignment as PyASCIIObject. PyASCIIObject is aligned to sizeof(void *) Of course, if _PyUnicode_COMPACT_DATA definition is changed, it will cease to be true. Then apply my first patch, which may be a bit less effective for short strings if (size >= 2 * SIZEOF_LONG && ((size_t) p & LONG_PTR_MASK) == ((size_t) q & LONG_PTR_MASK)) { |
New tests. I'm not conviced by the patch: it slows down the decoder for "short" strings. I don't understand which kind of ASCII encoded strings (specific length or content?) are optimized by the patch. Unpatched: $ ./python -m timeit -n 50000 -r 100 -s 'data=open("README", "r").read().encode("ascii")' 'data.decode("ASCII")'
50000 loops, best of 100: 1.41 usec per loop
$ ./python -m timeit -n 1000 -s 'import codecs; d = codecs.getdecoder("ascii"); x = bytes(range(128))*10' 'd(x)'
1000 loops, best of 3: 0.564 usec per loop
$ ./python -m timeit -n 1000 -s 'import codecs; d = codecs.getdecoder("ascii"); x = bytes(range(128))*1000' 'd(x)'
1000 loops, best of 3: 24.4 usec per loop
$ ./python -m timeit -n 10 -s 'import codecs; d = codecs.getdecoder("ascii"); x = bytes(range(128))*100000' 'd(x)'
10 loops, best of 3: 10.9 msec per loop
$ ./python -m timeit -n 1000 -s 'enc = "ascii"; import codecs; d = codecs.getdecoder(enc); x = ("\u0020" * 1000000).encode(enc)' 'd(x)'
1000 loops, best of 3: 722 usec per loop Patched: $ ./python -m timeit -n 50000 -r 100 -s 'data=open("README", "r").read().encode("ascii")' 'data.decode("ASCII")'
50000 loops, best of 100: 1.74 usec per loop
$ ./python -m timeit -n 1000 -s 'import codecs; d = codecs.getdecoder("ascii"); x = bytes(range(128))*10' 'd(x)'
1000 loops, best of 3: 0.597 usec per loop
$ ./python -m timeit -n 1000 -s 'import codecs; d = codecs.getdecoder("ascii"); x = bytes(range(128))*1000' 'd(x)'
1000 loops, best of 3: 27.3 usec per loop
$ ./python -m timeit -n 10 -s 'import codecs; d = codecs.getdecoder("ascii"); x = bytes(range(128))*100000' 'd(x)'
10 loops, best of 3: 8.32 msec per loop
$ ./python -m timeit -n 1000 -s 'enc = "ascii"; import codecs; d = codecs.getdecoder(enc); x = ("\u0020" * 1000000).encode(enc)' 'd(x)'
1000 loops, best of 3: 479 usec per loop |
May be you forgot the -r? Add -r 100 or -r 1000 and run tests a few times to evaluate the dispersion. I get the acceleration in all cases. For This may also depend on the processor and compiler. I have AMD Athlon 64 X2 4600+ (2-core, 2.4GHz, 512 KB cache) and use gcc 4.4.3 on 32-bit Linux. |
Then by choosing a string length that exceeds the L2 cache size, you may have found an ideal case for your optimization. Basically you're doing the error checking and the memcpy in one single pass. Honestly I'm not sure that's worth the hassle. ASCII-decoding is already very fast for shorter strings. (no need to pass "-n" or "-r" to timeit, it will figure out adequate numbers by itself) |
Python script too rough tool to measure decoding performance on short strings. To do this I used C. The benchmark scheme is as follows. Taken a big enough chunk of memory to reduce effect of processor cache. This area is splitted into many pieces with the same offset over long aligned block. Then measured a time of decoding all pieces of a certain size with a certain offset. Calculated an average time (ms) and decoding speed (MB/s). Run benchmark: |
As you can see, the unpatched code does not depend on the alignment. With patches aligned data (which constitute the vast majority, if not all) decoded much faster and non-aligned data decoded sometimes slightly slower. Time of decoding 2-10-bytes practically does not depend on the string length, most of the time (0.1 ms) occupy the overhead of function calls and objects creation and destruction. But even in this case, the patches show a steady increase in performance. When the overhead costs are reduced advantage becomes stronger. For short strings the second patch is better the first patch, as expected. |
Since the patch commited as part of new UTF-8 decoder, this issue can be closed (bpo-14738). |
Okay, thank you! |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: