This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Use PyUnicodeWriter in repr(dict)
Type: enhancement Stage:
Components: Unicode Versions: Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, python-dev, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2013-11-18 21:15 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
dict_repr_writer.patch vstinner, 2013-11-18 21:15 review
bench_dict_repr.py vstinner, 2013-11-18 21:15
Messages (5)
msg203322 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-11-18 21:15
Attached patch modify dict_repr() function to use the _PyUnicodeWriter API instead of building a list of short strings with PyUnicode_AppendAndDel() and calling PyUnicode_Join() at the end to join the list. PyUnicode_Append() is inefficient because it has to allocate a new string instead of reusing the same buffer.

_PyUnicodeWriter API has a different design. It overallocates a buffer to write Unicode characters and shrink the buffer at the end. It is faster according to my micro benchmark.


$ ./python ~/prog/HG/misc/python/benchmark.py compare_to pyaccu writer
Common platform:
CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
Python unicode implementation: PEP 393
CFLAGS: -Wno-unused-result -Werror=declaration-after-statement -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes
Timer precision: 40 ns
Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09)
Platform: Linux-3.9.4-200.fc18.x86_64-x86_64-with-fedora-18-Spherical_Cow
Bits: int=32, long=64, long long=64, size_t=64, void*=64
Timer: time.perf_counter

Platform of campaign pyaccu:
Date: 2013-11-18 21:37:44
Python version: 3.4.0a4+ (default:fc7ceb001eec, Nov 18 2013, 21:29:41) [GCC 4.7.2 20121109 (Red Hat 4.7.2-8)]
SCM: hg revision=fc7ceb001eec tag=tip branch=default date="2013-11-18 21:11 +0100"

Platform of campaign writer:
Date: 2013-11-18 22:10:40
Python version: 3.4.0a4+ (default:fc7ceb001eec+, Nov 18 2013, 22:10:12) [GCC 4.7.2 20121109 (Red Hat 4.7.2-8)]
SCM: hg revision=fc7ceb001eec+ tag=tip branch=default date="2013-11-18 21:11 +0100"

--------------------------------------+-------------+--------------
Tests                                 |      pyaccu |        writer
--------------------------------------+-------------+--------------
{"a": 1}                              |  603 ns (*) | 496 ns (-18%)
dict(zip("abc", range(3)))            | 1.05 us (*) | 904 ns (-14%)
{"%03d":"abc" for k in range(10)}     |  631 ns (*) | 501 ns (-21%)
{"%100d":"abc" for k in range(10)}    |  660 ns (*) | 484 ns (-27%)
{k:"a" for k in range(10**3)}         |  235 us (*) | 166 us (-30%)
{k:"abc" for k in range(10**3)}       |  245 us (*) | 177 us (-28%)
{"%100d":"abc" for k in range(10**3)} |  668 ns (*) | 478 ns (-28%)
{k:"a" for k in range(10**6)}         |  258 ms (*) | 186 ms (-28%)
{k:"abc" for k in range(10**6)}       |  265 ms (*) | 184 ms (-31%)
{"%100d":"abc" for k in range(10**6)} |  652 ns (*) | 489 ns (-25%)
--------------------------------------+-------------+--------------
Total                                 |  523 ms (*) | 369 ms (-29%)
--------------------------------------+-------------+--------------
msg203367 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-11-19 12:11
New changeset 3a354b879d1f by Victor Stinner in branch 'default':
Issue #19646: repr(dict) now uses _PyUnicodeWriter API for better performances
http://hg.python.org/cpython/rev/3a354b879d1f
msg203368 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-11-19 12:13
I added a new _PyUnicodeWriter_WriteASCIIString() function to reply to Serhiy's comment on Rietveld:
"Perhaps it will be worth to add a helper function or macros  _PyUnicodeWriter_WriteTwoAsciiChars()?"

changeset:   87263:d1ca05428c38
user:        Victor Stinner <victor.stinner@gmail.com>
date:        Tue Nov 19 12:54:53 2013 +0100
files:       Include/unicodeobject.h Objects/listobject.c Objects/unicodeobject.c Python/formatter_unicode.c
description:
Add _PyUnicodeWriter_WriteASCIIString() function

Using this function, there is no need to create temporary colon (": ") or sep (", ") strings, performances are a little better with the final commit.

Common platform:
Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09)
CFLAGS: -Wno-unused-result -Werror=declaration-after-statement -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes
Timer: time.perf_counter
Platform: Linux-3.9.4-200.fc18.x86_64-x86_64-with-fedora-18-Spherical_Cow
Python unicode implementation: PEP 393
CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
Bits: int=32, long=64, long long=64, size_t=64, void*=64

Platform of campaign pyaccu:
Python version: 3.4.0a4+ (default:99141ab08e21, Nov 19 2013, 13:10:27) [GCC 4.7.2 20121109 (Red Hat 4.7.2-8)]
Timer precision: 39 ns
Date: 2013-11-19 13:10:28
SCM: hg revision=99141ab08e21 branch=default date="2013-11-19 12:59 +0100"

Platform of campaign writer:
Python version: 3.4.0a4+ (default:3a354b879d1f, Nov 19 2013, 13:08:42) [GCC 4.7.2 20121109 (Red Hat 4.7.2-8)]
Timer precision: 46 ns
Date: 2013-11-19 13:09:20
SCM: hg revision=3a354b879d1f tag=tip branch=default date="2013-11-19 13:07 +0100"

--------------------------------------+-------------+--------------
Tests                                 |      pyaccu |        writer
--------------------------------------+-------------+--------------
{"a": 1}                              |  613 ns (*) | 338 ns (-45%)
dict(zip("abc", range(3)))            | 1.05 us (*) | 640 ns (-39%)
{"%03d":"abc" for k in range(10)}     |  635 ns (*) | 447 ns (-30%)
{"%100d":"abc" for k in range(10)}    |  651 ns (*) | 424 ns (-35%)
{k:"a" for k in range(10**3)}         |  233 us (*) | 132 us (-44%)
{k:"abc" for k in range(10**3)}       |  251 us (*) | 154 us (-39%)
{"%100d":"abc" for k in range(10**3)} |  668 ns (*) | 412 ns (-38%)
{k:"a" for k in range(10**6)}         |  268 ms (*) | 158 ms (-41%)
{k:"abc" for k in range(10**6)}       |  276 ms (*) | 163 ms (-41%)
{"%100d":"abc" for k in range(10**6)} |  658 ns (*) | 422 ns (-36%)
--------------------------------------+-------------+--------------
Total                                 |  544 ms (*) | 321 ms (-41%)
--------------------------------------+-------------+--------------
msg203372 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-11-19 13:18
> Using this function, there is no need to create temporary colon (": ") or sep (", ") strings, performances are a little better with the final commit.

I'm surprised that this has given such large effect. ;) I hoped only on more clear code.
msg203373 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-11-19 13:21
> I'm surprised that this has given such large effect. ;) I hoped only on more clear code.

To be honest, I expected shorter code but worse performances using _PyUnicodeWriter_WriteASCIIString().

dict_repr() was not really super fast: it did call PyUnicode_FromString() at each call to decode ": " and ", " from UTF-8. list_repr() and tuplerepr() kept ", " separator cached in a static variable. This is probably why the code is now faster.
History
Date User Action Args
2022-04-11 14:57:53adminsetgithub: 63845
2013-11-19 13:21:33vstinnersetmessages: + msg203373
2013-11-19 13:18:33serhiy.storchakasetmessages: + msg203372
2013-11-19 12:13:58vstinnersetstatus: open -> closed
resolution: fixed
2013-11-19 12:13:52vstinnersetmessages: + msg203368
2013-11-19 12:11:31python-devsetnosy: + python-dev
messages: + msg203367
2013-11-18 21:15:14vstinnersetfiles: + bench_dict_repr.py
2013-11-18 21:15:06vstinnercreate