Issue 14744: Use _PyUnicodeWriter API in str.format() internals

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/58949

classification

Title:	Use _PyUnicodeWriter API in str.format() internals
Type:		Stage:
Components:	Interpreter Core	Versions:	Python 3.3

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	loewis, mark.dickinson, pitrou, python-dev, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	patch

Created on 2012-05-07 22:09 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
format_writer.patch	vstinner, 2012-05-07 22:09		review
format_writer-2.patch	vstinner, 2012-05-08 23:58		review
issue14744-bench-1.py	serhiy.storchaka, 2012-05-09 08:03
inline_unicode_writer.patch	vstinner, 2012-05-09 11:53		review
dont_overallocate.patch	vstinner, 2012-05-13 02:26		review
benchmark.py	vstinner, 2012-05-13 02:27
REPORT_32BIT_2.7_3.2_writer	vstinner, 2012-05-23 21:10
REPORT_32BIT_3.3	vstinner, 2012-05-23 21:10
REPORT_64BIT_2.7_3.2_writer	vstinner, 2012-05-23 21:10
REPORT_64BIT_3.3	vstinner, 2012-05-23 21:11
REPORT_64BIT_PATCH	vstinner, 2012-05-23 21:38
faster-format.patch	vstinner, 2012-05-23 21:50		review
report_windows7	vstinner, 2012-05-29 22:35

Repositories containing patches
http://hg.python.org/features/faster-format/

Messages (37)
msg160176 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-05-07 22:09
Since 7be716a47e9d (issue #14716), str.format() uses the "unicode_writer" API. I propose to continue the work in this direction to avoid more temporary buffers. Python 3.3: 1000000 loops, best of 3: 0.573 usec per loop 100000 loops, best of 3: 16.4 usec per loop 1000000 loops, best of 3: 0.705 usec per loop 100000 loops, best of 3: 2.61 usec per loop Python 3.3 + patch (compared to Python 3.3): 1000000 loops, best of 3: 0.516 usec per loop (-10%) 100000 loops, best of 3: 13.2 usec per loop (-20%) 1000000 loops, best of 3: 0.574 usec per loop (-18%) 100000 loops, best of 3: 2.59 usec per loop (-1%) -- If this patch is accepted, it's more to go even deeper: use _PyUnicodeWriter in long_to_decimal_string() for example. -- Benchmark Python 3 / Python 2 bytes: python -m timeit -s 'fmt="{0}.{1}.{2}"' 'fmt.format("http", "client", "HTTPConnection")' python -m timeit -s 'fmt="{0:s}"100' 'fmt.format("ABCDEF")' python -m timeit -s 'fmt=" [line {0:2d}] "' 'fmt.format(5)' python -m timeit -s 'fmt="x={} y={} z={}"' 'fmt.format(12345, 12.345, 12.345+2j)' Benchmark Python 2 unicode: python -m timeit -s 'fmt=u"{0}.{1}.{2}"' 'fmt.format(u"http", u"client", u"HTTPConnection")' python -m timeit -s 'fmt=u"{0:s}"100' 'fmt.format(u"ABCDEF")' python -m timeit -s 'fmt=u" [line {0:2d}] "' 'fmt.format(5)' python -m timeit -s 'fmt=u"x={} y={} z={}"' 'fmt.format(12345, 12.345, 12.345+2j)' Python 2.7 bytes: 1000000 loops, best of 3: 0.393 usec per loop 100000 loops, best of 3: 9.72 usec per loop 1000000 loops, best of 3: 0.337 usec per loop 1000000 loops, best of 3: 1.56 usec per loop Python 2.7 wide: 1000000 loops, best of 3: 0.443 usec per loop 100000 loops, best of 3: 10.3 usec per loop 1000000 loops, best of 3: 0.785 usec per loop 100000 loops, best of 3: 2.48 usec per loop Python 3.2 wide: 1000000 loops, best of 3: 0.457 usec per loop 100000 loops, best of 3: 10.5 usec per loop 1000000 loops, best of 3: 0.538 usec per loop 100000 loops, best of 3: 2.36 usec per loop
msg160177 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-05-07 22:19
Comments on the patch. -PyAPI_FUNC(PyObject ) _PyComplex_FormatAdvanced(PyObject obj, +PyAPI_FUNC(int) _PyComplex_FormatWriter(PyObject obj, Even if it is a private function, I prefer to rename it because its API does change. / Use the inlined version in unicodeobject.c */ #define _PyUnicodeWriter_prepare _PyUnicodeWriter_prepare_inline #define _PyUnicodeWriter_write_substr _PyUnicodeWriter_write_substr_inline #define _PyUnicodeWriter_write_str _PyUnicodeWriter_write_str_inline #define _PyUnicodeWriter_write_char _PyUnicodeWriter_write_char_inline Inlining may be removed to simplify the code (but inlining does speed up the code a little bit). Or the opposite: this code should be moved to a new "unicodewriterinline.h" file which would be included by unicodeobject.c and formatter_unicode.c.
msg160186 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-08 05:32
> If this patch is accepted, it's more to go even deeper: use _PyUnicodeWriter in long_to_decimal_string() for example. long_to_decimal_string() is already creates a string of known size. How _PyUnicodeWriter can help here? Issue3451 looks much more promising for int formatting. But it will take a lot of time to carefully check this.
msg160189 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-05-08 08:37
_PyUnicodeWriter in long_to_decimal_string() for example. > > long_to_decimal_string() is already creates a string of known size. How > _PyUnicodeWriter can help here? "x={}".format(123) uses a temporary buffer for "123". Using _PyUnicodeWriter even to format 123 would avoid a malloc() and a copy of the characters.
msg160190 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2012-05-08 09:02
> Issue3451 looks much more promising for int formatting. But it will take > a lot of time to carefully check this. I disagree: Issue 3451 is about asymptotically fast base conversion, and the changes proposed there are only going to kick in for numbers with hundreds of digits; it's not going to affect the common case at all.
msg160194 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-08 10:38
> "x={}".format(123) uses a temporary buffer for "123". This, apparently, is inevitable. I doubt that it is able to considerably optimize, not worsened str(int) (which is optimal for current algorithm). Note that the more complex formatting (with width) will still require the temporary buffer. Be very careful not to cause regress. > Using > _PyUnicodeWriter even to format 123 would avoid a malloc() and a copy of > the characters. Fill the ascii buffer and then copying can be cheaper than using _PyUnicodeWriter with general non-ascii string.
msg160236 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-05-08 23:58
> Fill the ascii buffer and then copying can be cheaper than using > _PyUnicodeWriter with general non-ascii string. Here is a new patch using _PyUnicodeWriter directly in longobject.c. According to my benchmark (see below), formating a small number (5 decimal digits) is 17% faster with my patch version 2 compared to tip, and 38% faster compared to Python 3.3 before my optimizations on str%tuples or str.format(). Creating a temporary PyUnicode is not cheap, at least for short strings. str%tuple and str.format() allocates len(format_string)+100 ASCII characters at the beginning, which is enough for "x={}".format(12345) for example. So only a resize is needed, and it looks like resizing is cheap. I'm not completly satisfied of the usage of Py_LOCAL_INLINE in unicodeobject.c for _PyUnicodeWriter methods. The same "hacks" (?) should be used in formatter_unicode.c. Shell script (bench.sh) used to benchmark: -------- echo -n "{0}.{1}.{2}: "; ./python -m timeit -r 10 -s 'fmt="{0}.{1}.{2}"' 'fmt.format("http", "client", "HTTPConnection")' echo -n " [line {0:2d}] : "; ./python -m timeit -r 10 -s 'fmt=" [line {0:2d}] "' 'fmt.format(5)' echo -n "str: "; ./python -m timeit -r 10 -s 'fmt="{0}"100' 'fmt.format("ABCDEF")' echo -n "str conv: "; ./python -m timeit -r 10 -s 'fmt="{0:s}"100' 'fmt.format("ABCDEF")' echo -n "long x 3: "; ./python -m timeit -r 10 -s 'fmt="x={0} x={0} x={0}"' 'fmt.format(12345)' echo -n "float x 3: "; ./python -m timeit -r 10 -s 'fmt="x={0} x={0} x={0}"' 'fmt.format(12.345)' echo -n "complex x 3: "; ./python -m timeit -r 10 -s 'fmt="x={0} x={0} x={0}"' 'fmt.format(12.345+2j)' echo -n "long, float, complex: "; ./python -m timeit -r 10 -s 'fmt="x={} y={} z={}"' 'fmt.format(12345, 12.345, 12.345+2j)' echo -n "huge long: "; ./python -m timeit -r 10 -s 'import math; huge=math.factorial(2000); fmt="x={}"' 'fmt.format(huge)' -------- Results: -------- 3.3: {0}.{1}.{2}: 1000000 loops, best of 10: 0.394 usec per loop [line {0:2d}] : 1000000 loops, best of 10: 0.519 usec per loop str: 100000 loops, best of 10: 7.01 usec per loop str conv: 100000 loops, best of 10: 13.3 usec per loop long x 3: 1000000 loops, best of 10: 0.569 usec per loop float x 3: 1000000 loops, best of 10: 1.62 usec per loop complex x 3: 100000 loops, best of 10: 3.34 usec per loop long, float, complex: 100000 loops, best of 10: 2.08 usec per loop huge long: 1000 loops, best of 10: 666 usec per loop 3.3 + format_writer.patch : {0}.{1}.{2}: 1000000 loops, best of 10: 0.412 usec per loop (+5%) [line {0:2d}] : 1000000 loops, best of 10: 0.461 usec per loop (-11%) str: 100000 loops, best of 10: 6.85 usec per loop (-2%) str conv: 100000 loops, best of 10: 11.1 usec per loop (-17%) long x 3: 1000000 loops, best of 10: 0.605 usec per loop (+6%) float x 3: 1000000 loops, best of 10: 1.57 usec per loop (-3%) complex x 3: 100000 loops, best of 10: 3.54 usec per loop (+6%) long, float, complex: 100000 loops, best of 10: 2.19 usec per loop (+5%) huge long: 1000 loops, best of 10: 665 usec per loop (0%) 3.3 + format_writer-2.patch : {0}.{1}.{2}: 1000000 loops, best of 10: 0.378 usec per loop (-4%) [line {0:2d}] : 1000000 loops, best of 10: 0.454 usec per loop (-13%) str: 100000 loops, best of 10: 6.18 usec per loop (-12%) str conv: 100000 loops, best of 10: 10.9 usec per loop (-18%) long x 3: 1000000 loops, best of 10: 0.471 usec per loop (-17%) float x 3: 1000000 loops, best of 10: 1.37 usec per loop (-15%) complex x 3: 100000 loops, best of 10: 3.4 usec per loop (+2%) long, float, complex: 1000000 loops, best of 10: 1.93 usec per loop (-7%) huge long: 1000 loops, best of 10: 665 usec per loop (0%) --------
msg160244 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-05-09 05:38
> According to my benchmark (see below), formating a small number (5 > decimal digits) is 17% faster with my patch version 2 compared to tip, > and 38% faster compared to Python 3.3 before my optimizations on str% > tuples or str.format(). Creating a temporary PyUnicode is not cheap, > at least for short strings. A 17% improvement on a micro-benchmark is not much. There will probably be no visible difference in real-world code. Also, if creating temporary PyUnicodes is not cheap, perhaps we could have a freelist for them?
msg160259 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-09 08:03
> Here is a new patch using _PyUnicodeWriter directly in longobject.c. It may be worth to do it in a separate issue? decimal digits) is 17% faster with my patch version 2 compared to tip, and 38% faster compared to Python 3.3 before my optimizations on str% tuples or str.format(). Creating a temporary PyUnicode is not cheap, at least for short strings. Here is my benchmark script (attached) and the results: Python 3.3 (vanilla): 1.65 str(12345) 2.6 '{}'.format(12345) 2.69 'A{}'.format(12345) 3.16 '\x80{}'.format(12345) 3.23 '\u0100{}'.format(12345) 3.32 '\U00010000{}'.format(12345) 4.6 '{:-10}'.format(12345) 4.89 'A{:-10}'.format(12345) 5.53 '\x80{:-10}'.format(12345) 5.71 '\u0100{:-10}'.format(12345) 5.63 '\U00010000{:-10}'.format(12345) 4.6 '{:,}'.format(12345) 4.71 'A{:,}'.format(12345) 5.28 '\x80{:,}'.format(12345) 5.65 '\u0100{:,}'.format(12345) 5.59 '\U00010000{:,}'.format(12345) Python 3.3 + format_writer.patch: 1.72 str(12345) 2.74 '{}'.format(12345) 2.99 'A{}'.format(12345) 3.4 '\x80{}'.format(12345) 3.52 '\u0100{}'.format(12345) 3.51 '\U00010000{}'.format(12345) 4.24 '{:-10}'.format(12345) 4.6 'A{:-10}'.format(12345) 5.16 '\x80{:-10}'.format(12345) 6.87 '\u0100{:-10}'.format(12345) 6.83 '\U00010000{:-10}'.format(12345) 4.12 '{:,}'.format(12345) 4.6 'A{:,}'.format(12345) 5.09 '\x80{:,}'.format(12345) 6.63 '\u0100{:,}'.format(12345) 6.42 '\U00010000{:,}'.format(12345) Python 3.3 + format_writer-2.patch: 1.91 str(12345) 2.44 '{}'.format(12345) 2.61 'A{}'.format(12345) 3.08 '\x80{}'.format(12345) 3.31 '\u0100{}'.format(12345) 3.13 '\U00010000{}'.format(12345) 4.57 '{:-10}'.format(12345) 4.96 'A{:-10}'.format(12345) 5.52 '\x80{:-10}'.format(12345) 7.01 '\u0100{:-10}'.format(12345) 7.34 '\U00010000{:-10}'.format(12345) 4.42 '{:,}'.format(12345) 4.76 'A{:,}'.format(12345) 5.16 '\x80{:,}'.format(12345) 7.2 '\u0100{:,}'.format(12345) 6.74 '\U00010000{:,}'.format(12345) As you can see, there is a regress, and sometimes it is not less than improvement.
msg160277 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-05-09 11:53
> Inlining may be removed to simplify the code Attached inline_unicode_writer.patch does inline the code but also call only unicode_writer_prepare() once for each argument in PyUnicode_Format(). The patch removes unicode_writer_write_char() and unicode_writer_write_str() which had an overhead for the following patches (format_writer.patch, format_writer-2.patch). > As you can see, there is a regress, and sometimes > it is not less than improvement. Oh yes, thanks for your benchmark. I will analyze why my patch slows down these cases and try to update my patch to be applicable after inline_unicode_writer.patch. I suppose that the _PyUnicodeWriter API has a little overhead which is seen by microbenchmarks.
msg160313 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-05-09 20:32
New changeset 6c8a117f8966 by Victor Stinner in branch 'default': Issue #14744: Inline unicode_writer_write_char() and unicode_write_str() http://hg.python.org/cpython/rev/6c8a117f8966
msg160503 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-05-13 02:26
To prepare a deeper change, here is a first simple patch. Change how the size of the _PyUnicodeWriter buffer is handled: * overallocate by 100% (instead of 25%) and allocate at least 100 characters * don't overallocate when possible It is possible to not overallocate when the format string has no prefix nor suffix and when there is only one argument. For example, "%s" % "abc" does directly allocate the exact buffer size and so don't need to call realloc() at the end. The patch does also micro-optimize (cleanup?) _PyUnicodeWriter_Finish. When it's possible to not overallocate, the speed up is around 10% for short strings (I suppose that it's much better to longer strings). -- Output of the attached benchmark.py script: Linux-3.3.2-1.fc16.x86_64-x86_64-with-fedora-16-Verne Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz === 3.3 === 16 ms: N=200; fmt="{0}-"N; arg="abc"; fmt.format(arg) 5.77 ms: N=200; L=3; fmt="%s"N; args=("a"L,)N; fmt % args 648 ns: s="The {k1} is {k2} the {k3}."; args={"k1": "x", "k2": "y", "k3": "z"}; s.format(*args) 427 ns: s="The %(k1)s is %(k2)s the %(k3)s."; args={"k1":"x","k2":"y","k3":"z",}; s%args 531 ns: s="x={}, y={}, z={}"; args=(123, 456, 789); s.format(args) 453 ns: s="x=%s, y=%u, z=%x"; args=(123, 456, 789); s%args 180 ns: fmt="{}"; arg="abc"; fmt.format(arg) 228 ns: fmt="{}"; arg=123; fmt.format(arg) 196 ns: fmt="x={}"; arg="abc"; fmt.format(arg) 244 ns: fmt="x={}"; arg=123; fmt.format(arg) 175 ns: str(12345) 233 ns: '{}'.format(12345) 243 ns: 'A{}'.format(12345) 276 ns: '\x80{}'.format(12345) 292 ns: '\u0100{}'.format(12345) 285 ns: '\U00010000{}'.format(12345) 417 ns: '{:-10}'.format(12345) 425 ns: 'A{:-10}'.format(12345) 461 ns: '\x80{:-10}'.format(12345) 488 ns: '\u0100{:-10}'.format(12345) 461 ns: '\U00010000{:-10}'.format(12345) 403 ns: '{:,}'.format(12345) 416 ns: 'A{:,}'.format(12345) 455 ns: '\x80{:,}'.format(12345) 469 ns: '\u0100{:,}'.format(12345) 448 ns: '\U00010000{:,}'.format(12345) 108 ns: fmt="%s"; arg="abc"; fmt % arg 163 ns: fmt="%s"; arg=123; fmt % arg 121 ns: fmt="%s:"; arg="abc"; fmt % arg === 3.3 + dont_overallocate.patch === 15.5 ms: N=200; fmt="{0}-"N; arg="abc"; fmt.format(arg) 6 ms: N=200; L=3; fmt="%s"N; args=("a"L,)N; fmt % args 599 ns: s="The {k1} is {k2} the {k3}."; args={"k1": "x", "k2": "y", "k3": "z"}; s.format(*args) 424 ns: s="The %(k1)s is %(k2)s the %(k3)s."; args={"k1":"x","k2":"y","k3":"z",}; s%args 531 ns: s="x={}, y={}, z={}"; args=(123, 456, 789); s.format(args) 441 ns: s="x=%s, y=%u, z=%x"; args=(123, 456, 789); s%args 155 ns: fmt="{}"; arg="abc"; fmt.format(arg) 206 ns: fmt="{}"; arg=123; fmt.format(arg) 204 ns: fmt="x={}"; arg="abc"; fmt.format(arg) 247 ns: fmt="x={}"; arg=123; fmt.format(arg) 172 ns: str(12345) 209 ns: '{}'.format(12345) 248 ns: 'A{}'.format(12345) 257 ns: '\x80{}'.format(12345) 265 ns: '\u0100{}'.format(12345) 263 ns: '\U00010000{}'.format(12345) 354 ns: '{:-10}'.format(12345) 404 ns: 'A{:-10}'.format(12345) 415 ns: '\x80{:-10}'.format(12345) 457 ns: '\u0100{:-10}'.format(12345) 430 ns: '\U00010000{:-10}'.format(12345) 369 ns: '{:,}'.format(12345) 418 ns: 'A{:,}'.format(12345) 437 ns: '\x80{:,}'.format(12345) 459 ns: '\u0100{:,}'.format(12345) 448 ns: '\U00010000{:,}'.format(12345) 76 ns: fmt="%s"; arg="abc"; fmt % arg 133 ns: fmt="%s"; arg=123; fmt % arg 118 ns: fmt="%s:"; arg="abc"; fmt % arg
msg160513 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-05-13 09:56
> When it's possible to not overallocate, the speed up is around 10% for > short strings (I suppose that it's much better to longer strings). Well, please post a benchmark for long strings, then :-) I think 10% on a micro-benchmark is not worth the complication. This code is already complicated enough.
msg160565 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-05-13 20:10
>> When it's possible to not overallocate, the speed up is around 10% for >> short strings (I suppose that it's much better to longer strings). > Well, please post a benchmark for long strings, then :-) Ok, here you have. I don't understand why results are so random. There is no performance regression, but there are interesting speed up. Original: 200 ns: L=10 ** 3; fmt="%s"; args="a" * L; fmt % args 370 ns: L=10 ** 4; fmt="%s"; args="a" * L; fmt % args 3.85 ms: L=10 ** 5; fmt="%s"; args="a" * L; fmt % args 334 ms: L=10 ** 6; fmt="%s"; args="a" * L; fmt % args 2.87 ms: L=10 ** 7; fmt="%s"; args="a" * L; fmt % args 23.9 ms: L=10 ** 8; fmt="%s"; args="a" * L; fmt % args Patched: 120 ns (-40%): L=10 ** 3; fmt="%s"; args="a" * L; fmt % args 276 ns (-25%): L=10 ** 4; fmt="%s"; args="a" * L; fmt % args 3.7 ms (-4%): L=10 ** 5; fmt="%s"; args="a" * L; fmt % args 67.6 ms (-80%): L=10 ** 6; fmt="%s"; args="a" * L; fmt % args 1.38 ms (-51%): L=10 ** 7; fmt="%s"; args="a" * L; fmt % args 23.5 ms (-2%): L=10 ** 8; fmt="%s"; args="a" * L; fmt % args
msg160567 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-13 20:14
Not quite honest contrexample: ./python -m timeit -s "f='[{}]'.format;s='A'*100" "f(s)" Python 3.3: 1000000 loops, best of 3: 1.67 usec per loop Python 3.3 + dont_overallocate.patch: 100000 loops, best of 3: 2.01 usec per loop
msg160568 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-05-13 20:15
> >> When it's possible to not overallocate, the speed up is around 10% for > >> short strings (I suppose that it's much better to longer strings). > > Well, please post a benchmark for long strings, then :-) > > Ok, here you have. I don't understand why results are so random. There > is no performance regression, but there are interesting speed up. Do you have anything more interesting than fmt="%s" ?
msg160574 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-05-13 20:41
> Not quite honest contrexample I agree, this example is not honest :-) It's because of the magical value 100 used as initial size of the buffer. The speed is the same for shorter or longer strings.
msg160578 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-13 21:18
It seems to me that the proposed changes are too tricky and too dirty for such a modest gain. It seems to me, this effect can be achieved easier (special-casing "%s" and "{}" to return str(arg)?). If you want to get really impressive results, try to compile a pattern into an optimized internal representation and cache it (as in the struct module). This completely different approach may work.
msg160579 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-05-13 21:26
> Do you have anything more interesting than fmt="%s" ? and > It seems to me that the proposed changes are too tricky and too dirty for such a modest gain. To be honest, I didn't write dont_overallocate.patch to speed up formatting strings, but it's a patch to prepare another change on str.format. The patch limits the overhead of _PyUnicodeWriter for str.format. I proposed the change as a separated patch because I prefer simple patches: they are easier to review and to benchmark. But I agree the speed up is less impressive :-) I also prefer small patches for practical reasons. It's not easy to exchange huge patches on the tracker and to manipulate them. I always hesitate to start a new HG repository for a specific issue. I will rewrite my format_writer-2.patch based on dont_overallocate.patch. It looks like you are waiting for the full patch.
msg160581 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-05-13 21:32
> I will rewrite my format_writer-2.patch based on > dont_overallocate.patch. It looks like you are waiting for the full > patch. Well, there's no point in committing the first patch if the second one doesn't give an interesting speedup.
msg160610 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-05-14 12:05
I created a new repository to optimize str.format and str%args.
msg161459 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-05-23 21:20
Because I don't know what should be tested, I wrote a lot a tests in the bench_str.py script. To run the benchmark, use: ./python benchmark.py --file=FILE script bench_str.py Then to compare results: ./python benchmark.py compare_to FILE1 FILE2 FILE3 ... Download scripts from: https://bitbucket.org/haypo/misc/src/tip/python/benchmark.py https://bitbucket.org/haypo/misc/src/tip/python/bench_str.py (I wrote the benchmark.py tool for this issue, it's a a generic tool to create and compare benchmarks.) I attached results on 32 and 64 bits. * Python 2.7 vs 3.2 vs 3.3 (faster-format branch): REPORT_32BIT_2.7_3.2_writer and REPORT_64BIT_2.7_3.2_writer * Python 3.3, UCS4 buffer vs PyAccu vs _PyUnicodeWriter: REPORT_32BIT_3.3 and REPORT_64BIT_3.3
msg161460 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-05-23 21:26
When posting benchmark numbers, can you please only compared patched against unpatched? I don't think we care about performance compared to 3.2 or 2.7 here, and it would make things more readable.
msg161461 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-05-23 21:34
For Python 3.3, _PyUnicodeWriter API is faster than the Py_UCS4 buffer API and PyAccu API in quite all cases, with a speedup between 30% and 100%. But there are some cases where the _PyUnicodeWriter API is slower: fmt="x={}"; arg=12.345; fmt.format(arg) fmt="{}:"; arg=12.345; fmt.format(arg) fmt="x=%s"; arg="\u20ac" * 3; fmt % arg fmt="%s:"; arg="abc"; fmt % arg fmt="%s:"; arg="\u20ac" * 3; fmt % arg fmt="\u20ac[%s]"; arg="abc"; fmt % arg fmt="\u20ac[%s]"; arg="\u20ac" * 3; fmt % arg fmt="\u20ac[%s]"; arg=12.345; fmt % arg fmt="\u20ac[%s]"; arg=2j; fmt % arg
msg161462 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-05-23 21:38
> When posting benchmark numbers, can you please only compared > patched against unpatched? Here you have: REPORT_64BIT_PATCH.
msg161464 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-05-23 21:50
faster-format.patch: Patch for Python 3.3 optimizing str%args and str.format(args), use _PyUnicodeWriter deeper in formatting. The patch uses different optimizations: * if the result is just a string, copy the string by reference, don't copy it by value. It's not something new, this optimization was already used by the PyAccu API. Examples: - "{}".format(str) - "%s".format(str) * avoid a temporary buffer to format integers (base 2, 8, 10, 16). Examples: - "decimal=%s".format(int) - "hex=%x".format(int) - "%o".format(int) - "{}".format(int) - "{:x}".format(int) * don't overallocate the last argument of a format string. Example: - "x=%s".format("A" * 4096)
msg161484 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-24 07:59
> For Python 3.3, _PyUnicodeWriter API is faster than the Py_UCS4 buffer API and PyAccu API in quite all cases, with a speedup between 30% and 100%. But there are some cases where the _PyUnicodeWriter API is slower: Perhaps most of these problems can be solved if instead of the boolean flag (overallocate/no overallocate) to use the Py_ssize_t parameter that indicates by how much should you overallocate (it is the length of the suffix in the format).
msg161494 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-05-24 09:57
>> For Python 3.3, _PyUnicodeWriter API is faster than the Py_UCS4 buffer API and PyAccu API in quite all cases, with a speedup between 30% and 100%. But there are some cases where the _PyUnicodeWriter API is slower: > > Perhaps most of these problems can be solved if instead of the boolean > flag (overallocate/no overallocate) to use the Py_ssize_t parameter that > indicates by how much should you overallocate (it is the length of the > suffix in the format). There is not only a flag (flags.overallocate): there is also the min_length, which is used and helps for str%args and str.format(args). My patch contains a lot of "tricks" to limit overallocation, e.g. don't overallocate if we are writing the last part of the output. Computing exactly the size of the buffer gives the best performance because it avoids a resize in _PyUnicodeWriter_Finish(). I tried for example to modify PyUnicode_Format() to parse the format string twice: first to compute the size of the output buffer, second to write characters. In my experience, parsing the format string twice is more expensive than reallocating the buffer (PyUnicode_READ is expensive), especially on short and simple format strings. I tried different methods to allocate the buffer of _PyUnicodeWriter: change the overallocation factor (0%, 25%, 50%, 100%), only overallocate +100 characters, etc. But I failed to find something better than the proposed patch. At least I can say than always disabling overallocation slows done many cases: when there is a suffix after an argument, or when there are more than one argument. Feel free to experiment other methods to estimate the size of the output buffer.
msg161784 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-05-28 14:01
So, do you have any comment or complain? Or can I commit the patch? Le 24 mai 2012 11:57, "STINNER Victor" <report@bugs.python.org> a écrit : > > STINNER Victor <victor.stinner@gmail.com> added the comment: > > >> For Python 3.3, _PyUnicodeWriter API is faster than the Py_UCS4 buffer > API and PyAccu API in quite all cases, with a speedup between 30% and 100%. > But there are some cases where the _PyUnicodeWriter API is slower: > > > > Perhaps most of these problems can be solved if instead of the boolean > > flag (overallocate/no overallocate) to use the Py_ssize_t parameter that > > indicates by how much should you overallocate (it is the length of the > > suffix in the format). > > There is not only a flag (flags.overallocate): there is also the > min_length, which is used and helps for str%args and str.format(args). > > My patch contains a lot of "tricks" to limit overallocation, e.g. > don't overallocate if we are writing the last part of the output. > > Computing exactly the size of the buffer gives the best performance > because it avoids a resize in _PyUnicodeWriter_Finish(). I tried for > example to modify PyUnicode_Format() to parse the format string twice: > first to compute the size of the output buffer, second to write > characters. In my experience, parsing the format string twice is more > expensive than reallocating the buffer (PyUnicode_READ is expensive), > especially on short and simple format strings. > > I tried different methods to allocate the buffer of _PyUnicodeWriter: > change the overallocation factor (0%, 25%, 50%, 100%), only > overallocate +100 characters, etc. But I failed to find something > better than the proposed patch. > > At least I can say than always disabling overallocation slows done > many cases: when there is a suffix after an argument, or when there > are more than one argument. > > Feel free to experiment other methods to estimate the size of the output > buffer. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue14744> > _______________________________________ >
msg161789 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-28 15:22
> So, do you have any comment or complain? Or can I commit the patch? I beg your pardon, I will do a review and additional benchmarks today. So far away I have to say, it is better to use stringlib approach, than the massive macros, which are more difficult to read and edit. However, I will do a benchmark to check if we can achieve the same effect with less change code.
msg161810 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-05-28 21:44
> So far away I have to say, it is better to use stringlib > approach, than the massive macros, which are more difficult > to read and edit. Ah, you don't like the two macros in longobject.c. Functions to write digits into a string may be appropriate in the stringlib.
msg161813 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-28 22:13
I just sent you a patch which does not use any macros or stringlib.
msg161814 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-05-28 23:00
> Functions to write digits into a string may be appropriate > in the stringlib. Oh, stringlib is specific to unicodeobject.c: it cannot be used outside.
msg161870 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-05-29 11:13
New changeset 22b56b0b8619 by Victor Stinner in branch 'default': Issue #14744: Use the new _PyUnicodeWriter internal API to speed up str%args and str.format(args) http://hg.python.org/cpython/rev/22b56b0b8619
msg161899 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-05-29 16:52
New changeset 6abab1a103a6 by Victor Stinner in branch 'default': Issue #14744: Fix compilation on Windows http://hg.python.org/cpython/rev/6abab1a103a6
msg161900 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-05-29 16:54
New changeset df0144f68d76 by Victor Stinner in branch 'default': Issue #14744: Fix compilation on Windows (part 2) http://hg.python.org/cpython/rev/df0144f68d76
msg161916 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-05-29 22:35
report_windows7: Comparaison of str%args and str.format() on Windows 7. * Python 2.7 (64 bits) * Python 3.2 (64 bits), narrow (UTF-16) * Python 3.3 (32 bits), PEP 393 The benchmark is not fair because Python 3.3 is compiled in 32 bits, but there are interesting results. Example: ("{0}"1024).format("A" 4096) takes 1 or 2 seconds with Python 2.7/3.2 and 0.008 sec with Python 3.3!

History
Date	User	Action	Args
2022-04-11 14:57:29	admin	set	github: 58949
2012-05-29 23:15:40	vstinner	set	status: open -> closed resolution: fixed
2012-05-29 22:35:18	vstinner	set	files: + report_windows7 messages: + msg161916
2012-05-29 16:54:02	python-dev	set	messages: + msg161900
2012-05-29 16:52:11	python-dev	set	messages: + msg161899
2012-05-29 11:13:24	python-dev	set	messages: + msg161870
2012-05-28 23:00:02	vstinner	set	messages: + msg161814
2012-05-28 22:13:29	serhiy.storchaka	set	messages: + msg161813
2012-05-28 21:44:37	vstinner	set	messages: + msg161810
2012-05-28 15:22:12	serhiy.storchaka	set	messages: + msg161789
2012-05-28 14:01:35	vstinner	set	messages: + msg161784
2012-05-24 09:57:29	vstinner	set	messages: + msg161494
2012-05-24 07:59:01	serhiy.storchaka	set	messages: + msg161484
2012-05-23 21:50:47	vstinner	set	files: + faster-format.patch messages: + msg161464
2012-05-23 21:48:00	vstinner	set	files: - faa88c50a3d2.diff
2012-05-23 21:38:06	vstinner	set	files: + REPORT_64BIT_PATCH messages: + msg161462
2012-05-23 21:34:59	vstinner	set	messages: + msg161461
2012-05-23 21:26:37	pitrou	set	messages: + msg161460
2012-05-23 21:20:35	vstinner	set	messages: + msg161459
2012-05-23 21:12:47	vstinner	set	files: + faa88c50a3d2.diff
2012-05-23 21:11:05	vstinner	set	files: + REPORT_64BIT_3.3
2012-05-23 21:10:54	vstinner	set	files: + REPORT_64BIT_2.7_3.2_writer
2012-05-23 21:10:42	vstinner	set	files: + REPORT_32BIT_3.3
2012-05-23 21:10:27	vstinner	set	files: + REPORT_32BIT_2.7_3.2_writer
2012-05-14 12:05:13	vstinner	set	hgrepos: + hgrepo125 messages: + msg160610
2012-05-13 21:32:26	pitrou	set	messages: + msg160581
2012-05-13 21:26:53	vstinner	set	messages: + msg160579
2012-05-13 21:18:26	serhiy.storchaka	set	messages: + msg160578
2012-05-13 20:41:06	vstinner	set	messages: + msg160574
2012-05-13 20:15:42	pitrou	set	messages: + msg160568
2012-05-13 20:14:33	serhiy.storchaka	set	messages: + msg160567
2012-05-13 20:10:28	vstinner	set	messages: + msg160565
2012-05-13 09:56:38	pitrou	set	messages: + msg160513
2012-05-13 02:27:04	vstinner	set	files: + benchmark.py
2012-05-13 02:26:48	vstinner	set	files: + dont_overallocate.patch messages: + msg160503
2012-05-09 20:32:53	python-dev	set	nosy: + python-dev messages: + msg160313
2012-05-09 11:53:45	vstinner	set	files: + inline_unicode_writer.patch messages: + msg160277
2012-05-09 08:03:21	serhiy.storchaka	set	files: + issue14744-bench-1.py messages: + msg160259
2012-05-09 05:38:23	pitrou	set	messages: + msg160244
2012-05-08 23:58:41	vstinner	set	files: + format_writer-2.patch messages: + msg160236
2012-05-08 10:38:52	serhiy.storchaka	set	messages: + msg160194
2012-05-08 09:02:13	mark.dickinson	set	nosy: + mark.dickinson messages: + msg160190
2012-05-08 08:37:05	vstinner	set	messages: + msg160189
2012-05-08 05:32:09	serhiy.storchaka	set	messages: + msg160186
2012-05-07 22:19:41	vstinner	set	messages: + msg160177
2012-05-07 22:09:20	vstinner	create