New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize str%tuple for the PEP 393 #58892
Comments
PyUnicode_Format() creates short temporary substrings. Attached patch tries to avoid substrings. For example, it avoids write of 1 character and repetition of 1 character like a space. PyUnicode_Format() now works in two steps:
I'm not sure that my patch is correct, nor that the change does really speed-up Python. Benchmark: ./python -m timeit \ Python 3.2: 1000000 loops, best of 3: 0.482 usec per loop Python 3.3: 1000000 loops, best of 3: 0.653 usec per loop Python 3.3 + patch: 1000000 loops, best of 3: 0.596 usec per loop |
I see sped up +10% on Intel Atom (but 3.2 still 2x fast). With non-ascii arguments speed up can be a little bit larger. |
Updated patch:
|
New changeset 42fbb4f9b540 by Victor Stinner in branch 'default': New changeset 08b54c635586 by Victor Stinner in branch 'default': |
New changeset 4b98ce6ef95e by Victor Stinner in branch 'default': New changeset a966f9311ebb by Victor Stinner in branch 'default': |
pyunicode_format_writer.patch: a new completly different approach. It's an optimistic patch: start with a short ASCII buffer, and grows slowly the buffer, and convert to UCS2 and maybe to UCS4 if needed. The UTF-8 decoder is based on the same idea. The patch adds a "unicode writer", the optimistic writer. It overallocates the buffer by 50% to limit the number of calls to PyUnicode_Resize(). It may be reused by other functions. My dummy benchmark script: $ cat ~/bench.sh
./python -m timeit \
-s 'fmt="%s:"; arg="abc"' \
'fmt % arg'
./python -m timeit \
-s 'N=200; L=3; fmt="%s"*N; args=("a"*L,)*N' \
'fmt % args'
./python -m timeit \
-s 's="x=%s, y=%u, z=%x"; args=(123, 456, 789)' \
's%args'
./python -m timeit \
-s 's="The %(k1)s is %(k2)s the %(k3)s."; args={"k1":"x","k2":"y","k3":"z",}' \
's%args' Results. Python 3.2: 10000000 loops, best of 3: 0.0916 usec per loop Python 3.3: 10000000 loops, best of 3: 0.169 usec per loop Python 3.3 optimist (compared to 3.3): 10000000 loops, best of 3: 0.123 usec per loop (-27%) Overhead of the PEP-393 (Python 3.2 => 3.3) without -> with the patch:
-- "%(name)s" syntax is still *much* slower than Python 3.2, I don't understand why. Parameters of the Unicode writer (overallocation factor and initial size) may be adjusted (later?) for better performances. |
New changeset 90b4c2d7c90d by Victor Stinner in branch 'default': |
I tried to compute the initial size from the args object. It is hard because we don't know how arguments are used. For example, an argument may not be a number formatted as decimal, but the width of an argument. Another example: "%.3s" may be used to only read the 3 first characters of a very long string. So I consider that it is *simpler and safer* to not guess anything from args, but only choose correctly the overallocation factor. I tried the following factors: 1 (no overallocation), 1.25, 1.5 and 2.0. It looks like 1.25 (pos*5/4) is the best compromise and offers the best performances. |
New changeset 830eeff4fe8f by Victor Stinner in branch 'default': |
Results on 32 bits (Intel Core i5 CPU 661 @ 3.33GHz) on Linux 3.0 with a new patch. Python 3.2: 10000000 loops, best of 3: 0.133 usec per loop Python 3.3 @ 1439e2d1f490 (before my first optimization on str%tuple): 10000000 loops, best of 3: 0.193 usec per loop Python 3.3 + patch, overallocate 50%: 10000000 loops, best of 3: 0.15 usec per loop Python 3.3 + patch, overallocate 25%: 10000000 loops, best of 3: 0.142 usec per loop I'm going to commit the new patch with an overallocation of 25%. |
New changeset f1db931b93d3 by Victor Stinner in branch 'default': |
It would be nice to have measurements under Windows. |
New changeset 0a9143d7b097 by Victor Stinner in branch 'default': |
The differences between 32-bit Linux and 32-bit Windows should not be. |
The Windows memory allocator is quite different from the glibc's, so the |
Your tests only for ascii. You should also see some of corner cases -- |
New changeset f58159e5d52f by Victor Stinner in branch 'default': |
Do you think that this optimization is relevant to 2.7? |
No, this is just an attempt to deal with the shortcomings of PEP-393. |
Results on Windows Seven 64 bits, on Intel i7 2600 @ 3.4 GHz (8 cores), Windows running in a virtual machine (kvm) with hardware virtualization. Python 3.2.2: 10000000 loops, best of 3: 0.12 usec per loop Python 3.3 @ 1439e2d1f490: 1000000 loops, best of 3: 0.265 usec per loop Python 3.3 @ cdc4e0f8135d: 10000000 loops, best of 3: 0.154 usec per loop To be honest, I'm surprised that my work speeds up Python 3.3 on Windows, because I read that realloc() on Windows is not efficient. It is maybe no more true with Windows Seven? It would be interesting if someone could run the benchmark on Windows XP or 2000. |
Very nice, thank you! |
|
My benchmark is more a *micro* benchmark on some very basic cases (short ASCII strings). But it looks like overallocating *helps*. In the following example, only two resize are needed: ./python -m timeit \ len(fmt)+100 = 500 characters are allocated for the initial buffer. Writing the 501st character enlarges the buffer to 626 characters: first resize. The output string is truncated to 600 characters: second and final resize. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: