This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Use _PyUnicodeWriter API in str.format() internals
Type: Stage:
Components: Interpreter Core Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: loewis, mark.dickinson, pitrou, python-dev, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2012-05-07 22:09 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
format_writer.patch vstinner, 2012-05-07 22:09 review
format_writer-2.patch vstinner, 2012-05-08 23:58 review
issue14744-bench-1.py serhiy.storchaka, 2012-05-09 08:03
inline_unicode_writer.patch vstinner, 2012-05-09 11:53 review
dont_overallocate.patch vstinner, 2012-05-13 02:26 review
benchmark.py vstinner, 2012-05-13 02:27
REPORT_32BIT_2.7_3.2_writer vstinner, 2012-05-23 21:10
REPORT_32BIT_3.3 vstinner, 2012-05-23 21:10
REPORT_64BIT_2.7_3.2_writer vstinner, 2012-05-23 21:10
REPORT_64BIT_3.3 vstinner, 2012-05-23 21:11
REPORT_64BIT_PATCH vstinner, 2012-05-23 21:38
faster-format.patch vstinner, 2012-05-23 21:50 review
report_windows7 vstinner, 2012-05-29 22:35
Repositories containing patches
http://hg.python.org/features/faster-format/
Messages (37)
msg160176 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-05-07 22:09
Since 7be716a47e9d (issue #14716), str.format() uses the "unicode_writer" API. I propose to continue the work in this direction to avoid more temporary buffers.

Python 3.3:

1000000 loops, best of 3: 0.573 usec per loop
100000 loops, best of 3: 16.4 usec per loop
1000000 loops, best of 3: 0.705 usec per loop
100000 loops, best of 3: 2.61 usec per loop

Python 3.3 + patch (compared to Python 3.3):

1000000 loops, best of 3: 0.516 usec per loop (-10%)
100000 loops, best of 3: 13.2 usec per loop (-20%)
1000000 loops, best of 3: 0.574 usec per loop (-18%)
100000 loops, best of 3: 2.59 usec per loop (-1%)

--

If this patch is accepted, it's more to go even deeper: use _PyUnicodeWriter in long_to_decimal_string() for example.

--

Benchmark Python 3 / Python 2 bytes:

python -m timeit -s 'fmt="{0}.{1}.{2}"' 'fmt.format("http", "client", "HTTPConnection")'
python -m timeit -s 'fmt="{0:s}"*100' 'fmt.format("ABCDEF")'
python -m timeit -s 'fmt=" [line {0:2d}] "' 'fmt.format(5)'
python -m timeit -s 'fmt="x={} y={} z={}"' 'fmt.format(12345, 12.345, 12.345+2j)'

Benchmark Python 2 unicode:

python -m timeit -s 'fmt=u"{0}.{1}.{2}"' 'fmt.format(u"http", u"client", u"HTTPConnection")'
python -m timeit -s 'fmt=u"{0:s}"*100' 'fmt.format(u"ABCDEF")'
python -m timeit -s 'fmt=u" [line {0:2d}] "' 'fmt.format(5)'
python -m timeit -s 'fmt=u"x={} y={} z={}"' 'fmt.format(12345, 12.345, 12.345+2j)'

Python 2.7 bytes:

1000000 loops, best of 3: 0.393 usec per loop
100000 loops, best of 3: 9.72 usec per loop
1000000 loops, best of 3: 0.337 usec per loop
1000000 loops, best of 3: 1.56 usec per loop

Python 2.7 wide:

1000000 loops, best of 3: 0.443 usec per loop
100000 loops, best of 3: 10.3 usec per loop
1000000 loops, best of 3: 0.785 usec per loop
100000 loops, best of 3: 2.48 usec per loop

Python 3.2 wide:

1000000 loops, best of 3: 0.457 usec per loop
100000 loops, best of 3: 10.5 usec per loop
1000000 loops, best of 3: 0.538 usec per loop
100000 loops, best of 3: 2.36 usec per loop
msg160177 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-05-07 22:19
Comments on the patch.

-PyAPI_FUNC(PyObject *) _PyComplex_FormatAdvanced(PyObject *obj,
+PyAPI_FUNC(int) _PyComplex_FormatWriter(PyObject *obj,

Even if it is a private function, I prefer to rename it because its API does change.

/* Use the inlined version in unicodeobject.c */
#define _PyUnicodeWriter_prepare _PyUnicodeWriter_prepare_inline
#define _PyUnicodeWriter_write_substr _PyUnicodeWriter_write_substr_inline
#define _PyUnicodeWriter_write_str _PyUnicodeWriter_write_str_inline
#define _PyUnicodeWriter_write_char _PyUnicodeWriter_write_char_inline

Inlining may be removed to simplify the code (but inlining does speed up the code a little bit). Or the opposite: this code should be moved to a new "unicodewriterinline.h" file which would be included by unicodeobject.c and formatter_unicode.c.
msg160186 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-08 05:32
> If this patch is accepted, it's more to go even deeper: use _PyUnicodeWriter in long_to_decimal_string() for example.

long_to_decimal_string() is already creates a string of known size. How
_PyUnicodeWriter can help here?

Issue3451 looks much more promising for int formatting. But it will take
a lot of time to carefully check this.
msg160189 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-05-08 08:37
_PyUnicodeWriter in long_to_decimal_string() for example.
>
> long_to_decimal_string() is already creates a string of known size. How
> _PyUnicodeWriter can help here?

"x={}".format(123) uses a temporary buffer for "123". Using
_PyUnicodeWriter even to format 123 would avoid a malloc() and a copy of
the characters.
msg160190 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2012-05-08 09:02
> Issue3451 looks much more promising for int formatting. But it will take
> a lot of time to carefully check this.

I disagree:  Issue 3451 is about *asymptotically* fast base conversion, and the changes proposed there are only going to kick in for numbers with hundreds of digits;  it's not going to affect the common case at all.
msg160194 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-08 10:38
> "x={}".format(123) uses a temporary buffer for "123".

This, apparently, is inevitable. I doubt that it is able to considerably
optimize, not worsened str(int) (which is optimal for current
algorithm). Note that the more complex formatting (with width) will
still require the temporary buffer.

Be very careful not to cause regress.

>  Using
> _PyUnicodeWriter even to format 123 would avoid a malloc() and a copy of
> the characters.

Fill the ascii buffer and then copying can be cheaper than using
_PyUnicodeWriter with general non-ascii string.
msg160236 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-05-08 23:58
> Fill the ascii buffer and then copying can be cheaper than using
> _PyUnicodeWriter with general non-ascii string.

Here is a new patch using _PyUnicodeWriter directly in longobject.c.

According to my benchmark (see below), formating a small number (5 decimal digits) is 17% faster with my patch version 2 compared to tip, and 38% faster compared to Python 3.3 before my optimizations on str%tuples or str.format(). Creating a temporary PyUnicode is not cheap, at least for short strings.

str%tuple and str.format() allocates len(format_string)+100 ASCII characters at the beginning, which is enough for "x={}".format(12345) for example. So only a resize is needed, and it looks like resizing is cheap.

I'm not completly satisfied of the usage of Py_LOCAL_INLINE in unicodeobject.c for _PyUnicodeWriter methods. The same "hacks" (?) should be used in formatter_unicode.c.

Shell script (bench.sh) used to benchmark:
--------
echo -n "{0}.{1}.{2}: "; ./python -m timeit -r 10 -s 'fmt="{0}.{1}.{2}"' 'fmt.format("http", "client", "HTTPConnection")'
echo -n " [line {0:2d}] : "; ./python -m timeit -r 10 -s 'fmt=" [line {0:2d}] "' 'fmt.format(5)'
echo -n "str: "; ./python -m timeit -r 10 -s 'fmt="{0}"*100' 'fmt.format("ABCDEF")'
echo -n "str conv: "; ./python -m timeit -r 10 -s 'fmt="{0:s}"*100' 'fmt.format("ABCDEF")'
echo -n "long x 3: "; ./python -m timeit -r 10 -s 'fmt="x={0} x={0} x={0}"' 'fmt.format(12345)'
echo -n "float x 3: "; ./python -m timeit -r 10 -s 'fmt="x={0} x={0} x={0}"' 'fmt.format(12.345)'
echo -n "complex x 3: "; ./python -m timeit -r 10 -s 'fmt="x={0} x={0} x={0}"' 'fmt.format(12.345+2j)'
echo -n "long, float, complex: "; ./python -m timeit -r 10 -s 'fmt="x={} y={} z={}"' 'fmt.format(12345, 12.345, 12.345+2j)'
echo -n "huge long: "; ./python -m timeit -r 10 -s 'import math; huge=math.factorial(2000); fmt="x={}"' 'fmt.format(huge)'
--------

Results:
--------
3.3:

{0}.{1}.{2}: 1000000 loops, best of 10: 0.394 usec per loop
 [line {0:2d}] : 1000000 loops, best of 10: 0.519 usec per loop
str: 100000 loops, best of 10: 7.01 usec per loop
str conv: 100000 loops, best of 10: 13.3 usec per loop
long x 3: 1000000 loops, best of 10: 0.569 usec per loop
float x 3: 1000000 loops, best of 10: 1.62 usec per loop
complex x 3: 100000 loops, best of 10: 3.34 usec per loop
long, float, complex: 100000 loops, best of 10: 2.08 usec per loop
huge long: 1000 loops, best of 10: 666 usec per loop

3.3 + format_writer.patch :

{0}.{1}.{2}: 1000000 loops, best of 10: 0.412 usec per loop (+5%)
 [line {0:2d}] : 1000000 loops, best of 10: 0.461 usec per loop (-11%)
str: 100000 loops, best of 10: 6.85 usec per loop (-2%)
str conv: 100000 loops, best of 10: 11.1 usec per loop (-17%)
long x 3: 1000000 loops, best of 10: 0.605 usec per loop (+6%)
float x 3: 1000000 loops, best of 10: 1.57 usec per loop (-3%)
complex x 3: 100000 loops, best of 10: 3.54 usec per loop (+6%)
long, float, complex: 100000 loops, best of 10: 2.19 usec per loop (+5%)
huge long: 1000 loops, best of 10: 665 usec per loop (0%)

3.3 + format_writer-2.patch :

{0}.{1}.{2}: 1000000 loops, best of 10: 0.378 usec per loop (-4%)
 [line {0:2d}] : 1000000 loops, best of 10: 0.454 usec per loop (-13%)
str: 100000 loops, best of 10: 6.18 usec per loop (-12%)
str conv: 100000 loops, best of 10: 10.9 usec per loop (-18%)
long x 3: 1000000 loops, best of 10: 0.471 usec per loop (-17%)
float x 3: 1000000 loops, best of 10: 1.37 usec per loop (-15%)
complex x 3: 100000 loops, best of 10: 3.4 usec per loop (+2%)
long, float, complex: 1000000 loops, best of 10: 1.93 usec per loop (-7%)
huge long: 1000 loops, best of 10: 665 usec per loop (0%)
--------
msg160244 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-09 05:38
> According to my benchmark (see below), formating a small number (5
> decimal digits) is 17% faster with my patch version 2 compared to tip,
> and 38% faster compared to Python 3.3 before my optimizations on str%
> tuples or str.format(). Creating a temporary PyUnicode is not cheap,
> at least for short strings.

A 17% improvement on a micro-benchmark is not much. There will probably
be no visible difference in real-world code.

Also, if creating temporary PyUnicodes is not cheap, perhaps we could
have a freelist for them?
msg160259 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-09 08:03
> Here is a new patch using _PyUnicodeWriter directly in longobject.c.

It may be worth to do it in a separate issue?

decimal digits) is 17% faster with my patch version 2 compared to tip,
and 38% faster compared to Python 3.3 before my optimizations on str%
tuples or str.format(). Creating a temporary PyUnicode is not cheap, at
least for short strings.

Here is my benchmark script (attached) and the results:

Python 3.3 (vanilla):

1.65	str(12345)
2.6	'{}'.format(12345)
2.69	'A{}'.format(12345)
3.16	'\x80{}'.format(12345)
3.23	'\u0100{}'.format(12345)
3.32	'\U00010000{}'.format(12345)
4.6	'{:-10}'.format(12345)
4.89	'A{:-10}'.format(12345)
5.53	'\x80{:-10}'.format(12345)
5.71	'\u0100{:-10}'.format(12345)
5.63	'\U00010000{:-10}'.format(12345)
4.6	'{:,}'.format(12345)
4.71	'A{:,}'.format(12345)
5.28	'\x80{:,}'.format(12345)
5.65	'\u0100{:,}'.format(12345)
5.59	'\U00010000{:,}'.format(12345)

Python 3.3 + format_writer.patch:

1.72	str(12345)
2.74	'{}'.format(12345)
2.99	'A{}'.format(12345)
3.4	'\x80{}'.format(12345)
3.52	'\u0100{}'.format(12345)
3.51	'\U00010000{}'.format(12345)
4.24	'{:-10}'.format(12345)
4.6	'A{:-10}'.format(12345)
5.16	'\x80{:-10}'.format(12345)
6.87	'\u0100{:-10}'.format(12345)
6.83	'\U00010000{:-10}'.format(12345)
4.12	'{:,}'.format(12345)
4.6	'A{:,}'.format(12345)
5.09	'\x80{:,}'.format(12345)
6.63	'\u0100{:,}'.format(12345)
6.42	'\U00010000{:,}'.format(12345)

Python 3.3 + format_writer-2.patch: 

1.91	str(12345)
2.44	'{}'.format(12345)
2.61	'A{}'.format(12345)
3.08	'\x80{}'.format(12345)
3.31	'\u0100{}'.format(12345)
3.13	'\U00010000{}'.format(12345)
4.57	'{:-10}'.format(12345)
4.96	'A{:-10}'.format(12345)
5.52	'\x80{:-10}'.format(12345)
7.01	'\u0100{:-10}'.format(12345)
7.34	'\U00010000{:-10}'.format(12345)
4.42	'{:,}'.format(12345)
4.76	'A{:,}'.format(12345)
5.16	'\x80{:,}'.format(12345)
7.2	'\u0100{:,}'.format(12345)
6.74	'\U00010000{:,}'.format(12345)

As you can see, there is a regress, and sometimes it is not less than
improvement.
msg160277 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-05-09 11:53
> Inlining may be removed to simplify the code

Attached inline_unicode_writer.patch does inline the code but also call only unicode_writer_prepare() once for each argument in PyUnicode_Format(). The patch removes unicode_writer_write_char() and unicode_writer_write_str() which had an overhead for the following patches (format_writer.patch, format_writer-2.patch).

> As you can see, there is a regress, and sometimes
> it is not less than improvement.

Oh yes, thanks for your benchmark. I will analyze why my patch slows down these cases and try to update my patch to be applicable after inline_unicode_writer.patch. I suppose that the _PyUnicodeWriter API has a little overhead which is seen by microbenchmarks.
msg160313 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-05-09 20:32
New changeset 6c8a117f8966 by Victor Stinner in branch 'default':
Issue #14744: Inline unicode_writer_write_char() and unicode_write_str()
http://hg.python.org/cpython/rev/6c8a117f8966
msg160503 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-05-13 02:26
To prepare a deeper change, here is a first simple patch. Change how the size of the _PyUnicodeWriter buffer is handled:

 * overallocate by 100% (instead of 25%) and allocate at least 100 characters
 * don't overallocate when possible

It is possible to not overallocate when the format string has no prefix nor suffix and when there is only one argument. For example, "%s" % "abc" does directly allocate the exact buffer size and so don't need to call realloc() at the end.

The patch does also micro-optimize (cleanup?) _PyUnicodeWriter_Finish.

When it's possible to not overallocate, the speed up is around 10% for short strings (I suppose that it's much better to longer strings).

--

Output of the attached benchmark.py script:

Linux-3.3.2-1.fc16.x86_64-x86_64-with-fedora-16-Verne
Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz

=== 3.3 ===

16 ms: N=200; fmt="{0}-"*N; arg="abc"; fmt.format(arg)
5.77 ms: N=200; L=3; fmt="%s"*N; args=("a"*L,)*N; fmt % args
648 ns: s="The {k1} is {k2} the {k3}."; args={"k1": "x", "k2": "y", "k3": "z"}; s.format(**args)
427 ns: s="The %(k1)s is %(k2)s the %(k3)s."; args={"k1":"x","k2":"y","k3":"z",}; s%args
531 ns: s="x={}, y={}, z={}"; args=(123, 456, 789); s.format(*args)
453 ns: s="x=%s, y=%u, z=%x"; args=(123, 456, 789); s%args
180 ns: fmt="{}"; arg="abc"; fmt.format(arg)
228 ns: fmt="{}"; arg=123; fmt.format(arg)
196 ns: fmt="x={}"; arg="abc"; fmt.format(arg)
244 ns: fmt="x={}"; arg=123; fmt.format(arg)
175 ns: str(12345)
233 ns: '{}'.format(12345)
243 ns: 'A{}'.format(12345)
276 ns: '\x80{}'.format(12345)
292 ns: '\u0100{}'.format(12345)
285 ns: '\U00010000{}'.format(12345)
417 ns: '{:-10}'.format(12345)
425 ns: 'A{:-10}'.format(12345)
461 ns: '\x80{:-10}'.format(12345)
488 ns: '\u0100{:-10}'.format(12345)
461 ns: '\U00010000{:-10}'.format(12345)
403 ns: '{:,}'.format(12345)
416 ns: 'A{:,}'.format(12345)
455 ns: '\x80{:,}'.format(12345)
469 ns: '\u0100{:,}'.format(12345)
448 ns: '\U00010000{:,}'.format(12345)
108 ns: fmt="%s"; arg="abc"; fmt % arg
163 ns: fmt="%s"; arg=123; fmt % arg
121 ns: fmt="%s:"; arg="abc"; fmt % arg

=== 3.3 + dont_overallocate.patch ===

15.5 ms: N=200; fmt="{0}-"*N; arg="abc"; fmt.format(arg)
 6 ms: N=200; L=3; fmt="%s"*N; args=("a"*L,)*N; fmt % args
599 ns: s="The {k1} is {k2} the {k3}."; args={"k1": "x", "k2": "y", "k3": "z"}; s.format(**args)
424 ns: s="The %(k1)s is %(k2)s the %(k3)s."; args={"k1":"x","k2":"y","k3":"z",}; s%args
531 ns: s="x={}, y={}, z={}"; args=(123, 456, 789); s.format(*args)
441 ns: s="x=%s, y=%u, z=%x"; args=(123, 456, 789); s%args
155 ns: fmt="{}"; arg="abc"; fmt.format(arg)
206 ns: fmt="{}"; arg=123; fmt.format(arg)
204 ns: fmt="x={}"; arg="abc"; fmt.format(arg)
247 ns: fmt="x={}"; arg=123; fmt.format(arg)
172 ns: str(12345)
209 ns: '{}'.format(12345)
248 ns: 'A{}'.format(12345)
257 ns: '\x80{}'.format(12345)
265 ns: '\u0100{}'.format(12345)
263 ns: '\U00010000{}'.format(12345)
354 ns: '{:-10}'.format(12345)
404 ns: 'A{:-10}'.format(12345)
415 ns: '\x80{:-10}'.format(12345)
457 ns: '\u0100{:-10}'.format(12345)
430 ns: '\U00010000{:-10}'.format(12345)
369 ns: '{:,}'.format(12345)
418 ns: 'A{:,}'.format(12345)
437 ns: '\x80{:,}'.format(12345)
459 ns: '\u0100{:,}'.format(12345)
448 ns: '\U00010000{:,}'.format(12345)
76 ns: fmt="%s"; arg="abc"; fmt % arg
133 ns: fmt="%s"; arg=123; fmt % arg
118 ns: fmt="%s:"; arg="abc"; fmt % arg
msg160513 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-13 09:56
> When it's possible to not overallocate, the speed up is around 10% for
> short strings (I suppose that it's much better to longer strings).

Well, please post a benchmark for long strings, then :-)
I think 10% on a micro-benchmark is not worth the complication. This
code is already complicated enough.
msg160565 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-05-13 20:10
>> When it's possible to not overallocate, the speed up is around 10% for
>> short strings (I suppose that it's much better to longer strings).
> Well, please post a benchmark for long strings, then :-)

Ok, here you have. I don't understand why results are so random. There
is no performance regression, but there are interesting speed up.

Original:

200 ns: L=10 ** 3; fmt="%s"; args="a" * L; fmt % args
370 ns: L=10 ** 4; fmt="%s"; args="a" * L; fmt % args
3.85 ms: L=10 ** 5; fmt="%s"; args="a" * L; fmt % args
334 ms: L=10 ** 6; fmt="%s"; args="a" * L; fmt % args
2.87 ms: L=10 ** 7; fmt="%s"; args="a" * L; fmt % args
23.9 ms: L=10 ** 8; fmt="%s"; args="a" * L; fmt % args

Patched:

120 ns (-40%): L=10 ** 3; fmt="%s"; args="a" * L; fmt % args
276 ns (-25%): L=10 ** 4; fmt="%s"; args="a" * L; fmt % args
3.7 ms (-4%): L=10 ** 5; fmt="%s"; args="a" * L; fmt % args
67.6 ms (-80%): L=10 ** 6; fmt="%s"; args="a" * L; fmt % args
1.38 ms (-51%): L=10 ** 7; fmt="%s"; args="a" * L; fmt % args
23.5 ms (-2%): L=10 ** 8; fmt="%s"; args="a" * L; fmt % args
msg160567 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-13 20:14
Not quite honest contrexample:

./python -m timeit -s "f='[{}]'.format;s='A'*100" "f(s)"

Python 3.3: 1000000 loops, best of 3: 1.67 usec per loop
Python 3.3 + dont_overallocate.patch: 100000 loops, best of 3: 2.01 usec per loop
msg160568 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-13 20:15
> >> When it's possible to not overallocate, the speed up is around 10% for
> >> short strings (I suppose that it's much better to longer strings).
> > Well, please post a benchmark for long strings, then :-)
> 
> Ok, here you have. I don't understand why results are so random. There
> is no performance regression, but there are interesting speed up.

Do you have anything more interesting than fmt="%s" ?
msg160574 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-05-13 20:41
> Not quite honest contrexample

I agree, this example is not honest :-) It's because of the magical value 100 used as initial size of the buffer. The speed is the same for shorter or longer strings.
msg160578 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-13 21:18
It seems to me that the proposed changes are too tricky and too dirty for such a modest gain. It seems to me, this effect can be achieved easier (special-casing "%s" and "{}" to return str(arg)?).

If you want to get really impressive results, try to compile a pattern into an optimized internal representation and cache it (as in the struct module). This completely different approach may work.
msg160579 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-05-13 21:26
> Do you have anything more interesting than fmt="%s" ?
and
> It seems to me that the proposed changes are too tricky and too dirty for such a modest gain.

To be honest, I didn't write dont_overallocate.patch to speed up formatting strings, but it's a patch to prepare another change on str.format. The patch limits the overhead of _PyUnicodeWriter for str.format.

I proposed the change as a separated patch because I prefer simple patches: they are easier to review and to benchmark. But I agree the speed up is less impressive :-) I also prefer small patches for practical reasons. It's not easy to exchange huge patches on the tracker and to manipulate them. I always hesitate to start a new HG repository for a specific issue.

I will rewrite my format_writer-2.patch based on dont_overallocate.patch. It looks like you are waiting for the full patch.
msg160581 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-13 21:32
> I will rewrite my format_writer-2.patch based on
> dont_overallocate.patch. It looks like you are waiting for the full
> patch.

Well, there's no point in committing the first patch if the second one
doesn't give an interesting speedup.
msg160610 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-05-14 12:05
I created a new repository to optimize str.format and str%args.
msg161459 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-05-23 21:20
Because I don't know what should be tested, I wrote a lot a tests in the bench_str.py script. To run the benchmark, use:

./python benchmark.py --file=FILE script bench_str.py

Then to compare results:

./python benchmark.py compare_to FILE1 FILE2 FILE3 ...

Download scripts from:
https://bitbucket.org/haypo/misc/src/tip/python/benchmark.py
https://bitbucket.org/haypo/misc/src/tip/python/bench_str.py

(I wrote the benchmark.py tool for this issue, it's a a generic tool to create and compare benchmarks.)

I attached results on 32 and 64 bits.

 * Python 2.7 vs 3.2 vs 3.3 (faster-format branch): REPORT_32BIT_2.7_3.2_writer and REPORT_64BIT_2.7_3.2_writer
 * Python 3.3, UCS4 buffer vs PyAccu vs _PyUnicodeWriter: REPORT_32BIT_3.3 and REPORT_64BIT_3.3
msg161460 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-23 21:26
When posting benchmark numbers, can you please only compared patched against unpatched? I don't think we care about performance compared to 3.2 or 2.7 here, and it would make things more readable.
msg161461 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-05-23 21:34
For Python 3.3, _PyUnicodeWriter API is faster than the Py_UCS4 buffer API and PyAccu API in quite all cases, with a speedup between 30% and 100%. But there are some cases where the _PyUnicodeWriter API is slower:

fmt="x={}"; arg=12.345; fmt.format(arg)
fmt="{}:"; arg=12.345; fmt.format(arg)
fmt="x=%s"; arg="\u20ac" * 3; fmt % arg
fmt="%s:"; arg="abc"; fmt % arg
fmt="%s:"; arg="\u20ac" * 3; fmt % arg
fmt="\u20ac[%s]"; arg="abc"; fmt % arg
fmt="\u20ac[%s]"; arg="\u20ac" * 3; fmt % arg
fmt="\u20ac[%s]"; arg=12.345; fmt % arg
fmt="\u20ac[%s]"; arg=2j; fmt % arg
msg161462 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-05-23 21:38
> When posting benchmark numbers, can you please only compared
> patched against unpatched?

Here you have: REPORT_64BIT_PATCH.
msg161464 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-05-23 21:50
faster-format.patch: Patch for Python 3.3 optimizing str%args and str.format(args), use _PyUnicodeWriter deeper in formatting. The patch uses different optimizations:

* if the result is just a string, copy the string by reference, don't copy it by value. It's not something new, this optimization was already used by the PyAccu API. Examples:

 - "{}".format(str)
 - "%s".format(str)

* avoid a temporary buffer to format integers (base 2, 8, 10, 16). Examples:

 - "decimal=%s".format(int)
 - "hex=%x".format(int)
 - "%o".format(int)
 - "{}".format(int)
 - "{:x}".format(int)

* don't overallocate the last argument of a format string. Example:

 - "x=%s".format("A" * 4096)
msg161484 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-24 07:59
> For Python 3.3, _PyUnicodeWriter API is faster than the Py_UCS4 buffer API and PyAccu API in quite all cases, with a speedup between 30% and 100%. But there are some cases where the _PyUnicodeWriter API is slower:

Perhaps most of these problems can be solved if instead of the boolean
flag (overallocate/no overallocate) to use the Py_ssize_t parameter that
indicates by how much should you overallocate (it is the length of the
suffix in the format).
msg161494 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-05-24 09:57
>> For Python 3.3, _PyUnicodeWriter API is faster than the Py_UCS4 buffer API and PyAccu API in quite all cases, with a speedup between 30% and 100%. But there are some cases where the _PyUnicodeWriter API is slower:
>
> Perhaps most of these problems can be solved if instead of the boolean
> flag (overallocate/no overallocate) to use the Py_ssize_t parameter that
> indicates by how much should you overallocate (it is the length of the
> suffix in the format).

There is not only a flag (flags.overallocate): there is also the
min_length, which is used and helps for str%args and str.format(args).

My patch contains a lot of "tricks" to limit overallocation, e.g.
don't overallocate if we are writing the last part of the output.

Computing exactly the size of the buffer gives the best performance
because it avoids a resize in _PyUnicodeWriter_Finish(). I tried for
example to modify PyUnicode_Format() to parse the format string twice:
first to compute the size of the output buffer, second to write
characters. In my experience, parsing the format string twice is more
expensive than reallocating the buffer (PyUnicode_READ is expensive),
especially on short and simple format strings.

I tried different methods to allocate the buffer of _PyUnicodeWriter:
change the overallocation factor (0%, 25%, 50%, 100%), only
overallocate +100 characters, etc. But I failed to find something
better than the proposed patch.

At least I can say than always disabling overallocation slows done
many cases: when there is a suffix after an argument, or when there
are more than one argument.

Feel free to experiment other methods to estimate the size of the output buffer.
msg161784 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-05-28 14:01
So, do you have any comment or complain? Or can I commit the patch?
Le 24 mai 2012 11:57, "STINNER Victor" <report@bugs.python.org> a écrit :

>
> STINNER Victor <victor.stinner@gmail.com> added the comment:
>
> >> For Python 3.3, _PyUnicodeWriter API is faster than the Py_UCS4 buffer
> API and PyAccu API in quite all cases, with a speedup between 30% and 100%.
> But there are some cases where the _PyUnicodeWriter API is slower:
> >
> > Perhaps most of these problems can be solved if instead of the boolean
> > flag (overallocate/no overallocate) to use the Py_ssize_t parameter that
> > indicates by how much should you overallocate (it is the length of the
> > suffix in the format).
>
> There is not only a flag (flags.overallocate): there is also the
> min_length, which is used and helps for str%args and str.format(args).
>
> My patch contains a lot of "tricks" to limit overallocation, e.g.
> don't overallocate if we are writing the last part of the output.
>
> Computing exactly the size of the buffer gives the best performance
> because it avoids a resize in _PyUnicodeWriter_Finish(). I tried for
> example to modify PyUnicode_Format() to parse the format string twice:
> first to compute the size of the output buffer, second to write
> characters. In my experience, parsing the format string twice is more
> expensive than reallocating the buffer (PyUnicode_READ is expensive),
> especially on short and simple format strings.
>
> I tried different methods to allocate the buffer of _PyUnicodeWriter:
> change the overallocation factor (0%, 25%, 50%, 100%), only
> overallocate +100 characters, etc. But I failed to find something
> better than the proposed patch.
>
> At least I can say than always disabling overallocation slows done
> many cases: when there is a suffix after an argument, or when there
> are more than one argument.
>
> Feel free to experiment other methods to estimate the size of the output
> buffer.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue14744>
> _______________________________________
>
msg161789 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-28 15:22
> So, do you have any comment or complain? Or can I commit the patch?

I beg your pardon, I will do a review and additional benchmarks today.

So far away I have to say, it is better to use stringlib approach, than the 
massive macros, which are more difficult to read and edit. However, I will do a 
benchmark to check if we can achieve the same effect with less change code.
msg161810 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-05-28 21:44
> So far away I have to say, it is better to use stringlib
> approach, than the massive macros, which are more difficult
> to read and edit.

Ah, you don't like the two macros in longobject.c. Functions to write digits into a string may be appropriate in the stringlib.
msg161813 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-28 22:13
I just sent you a patch which does not use any macros or stringlib.
msg161814 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-05-28 23:00
> Functions to write digits into a string may be appropriate
> in the stringlib.

Oh, stringlib is specific to unicodeobject.c: it cannot be used outside.
msg161870 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-05-29 11:13
New changeset 22b56b0b8619 by Victor Stinner in branch 'default':
Issue #14744: Use the new _PyUnicodeWriter internal API to speed up str%args and str.format(args)
http://hg.python.org/cpython/rev/22b56b0b8619
msg161899 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-05-29 16:52
New changeset 6abab1a103a6 by Victor Stinner in branch 'default':
Issue #14744: Fix compilation on Windows
http://hg.python.org/cpython/rev/6abab1a103a6
msg161900 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-05-29 16:54
New changeset df0144f68d76 by Victor Stinner in branch 'default':
Issue #14744: Fix compilation on Windows (part 2)
http://hg.python.org/cpython/rev/df0144f68d76
msg161916 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-05-29 22:35
report_windows7: Comparaison of str%args and str.format() on Windows 7.
 * Python 2.7 (64 bits)
 * Python 3.2 (64 bits), narrow (UTF-16)
 * Python 3.3 (*32* bits), PEP 393

The benchmark is not fair because Python 3.3 is compiled in 32 bits, but there are interesting results. Example: ("{0}"*1024).format("A" * 4096) takes 1 or 2 seconds with Python 2.7/3.2 and 0.008 sec with Python 3.3!
History
Date User Action Args
2022-04-11 14:57:29adminsetgithub: 58949
2012-05-29 23:15:40vstinnersetstatus: open -> closed
resolution: fixed
2012-05-29 22:35:18vstinnersetfiles: + report_windows7

messages: + msg161916
2012-05-29 16:54:02python-devsetmessages: + msg161900
2012-05-29 16:52:11python-devsetmessages: + msg161899
2012-05-29 11:13:24python-devsetmessages: + msg161870
2012-05-28 23:00:02vstinnersetmessages: + msg161814
2012-05-28 22:13:29serhiy.storchakasetmessages: + msg161813
2012-05-28 21:44:37vstinnersetmessages: + msg161810
2012-05-28 15:22:12serhiy.storchakasetmessages: + msg161789
2012-05-28 14:01:35vstinnersetmessages: + msg161784
2012-05-24 09:57:29vstinnersetmessages: + msg161494
2012-05-24 07:59:01serhiy.storchakasetmessages: + msg161484
2012-05-23 21:50:47vstinnersetfiles: + faster-format.patch

messages: + msg161464
2012-05-23 21:48:00vstinnersetfiles: - faa88c50a3d2.diff
2012-05-23 21:38:06vstinnersetfiles: + REPORT_64BIT_PATCH

messages: + msg161462
2012-05-23 21:34:59vstinnersetmessages: + msg161461
2012-05-23 21:26:37pitrousetmessages: + msg161460
2012-05-23 21:20:35vstinnersetmessages: + msg161459
2012-05-23 21:12:47vstinnersetfiles: + faa88c50a3d2.diff
2012-05-23 21:11:05vstinnersetfiles: + REPORT_64BIT_3.3
2012-05-23 21:10:54vstinnersetfiles: + REPORT_64BIT_2.7_3.2_writer
2012-05-23 21:10:42vstinnersetfiles: + REPORT_32BIT_3.3
2012-05-23 21:10:27vstinnersetfiles: + REPORT_32BIT_2.7_3.2_writer
2012-05-14 12:05:13vstinnersethgrepos: + hgrepo125
messages: + msg160610
2012-05-13 21:32:26pitrousetmessages: + msg160581
2012-05-13 21:26:53vstinnersetmessages: + msg160579
2012-05-13 21:18:26serhiy.storchakasetmessages: + msg160578
2012-05-13 20:41:06vstinnersetmessages: + msg160574
2012-05-13 20:15:42pitrousetmessages: + msg160568
2012-05-13 20:14:33serhiy.storchakasetmessages: + msg160567
2012-05-13 20:10:28vstinnersetmessages: + msg160565
2012-05-13 09:56:38pitrousetmessages: + msg160513
2012-05-13 02:27:04vstinnersetfiles: + benchmark.py
2012-05-13 02:26:48vstinnersetfiles: + dont_overallocate.patch

messages: + msg160503
2012-05-09 20:32:53python-devsetnosy: + python-dev
messages: + msg160313
2012-05-09 11:53:45vstinnersetfiles: + inline_unicode_writer.patch

messages: + msg160277
2012-05-09 08:03:21serhiy.storchakasetfiles: + issue14744-bench-1.py

messages: + msg160259
2012-05-09 05:38:23pitrousetmessages: + msg160244
2012-05-08 23:58:41vstinnersetfiles: + format_writer-2.patch

messages: + msg160236
2012-05-08 10:38:52serhiy.storchakasetmessages: + msg160194
2012-05-08 09:02:13mark.dickinsonsetnosy: + mark.dickinson
messages: + msg160190
2012-05-08 08:37:05vstinnersetmessages: + msg160189
2012-05-08 05:32:09serhiy.storchakasetmessages: + msg160186
2012-05-07 22:19:41vstinnersetmessages: + msg160177
2012-05-07 22:09:20vstinnercreate