classification
Title: Rewrite StringIO to use the _PyUnicodeWriter API
Type: performance Stage:
Components: IO, Unicode Versions: Python 3.4
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, ezio.melotti, pitrou, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2012-08-10 02:30 by vstinner, last changed 2015-03-18 11:04 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
stringio_unicode_writer.patch vstinner, 2012-08-10 02:30 review
bench_stringio.py vstinner, 2012-08-10 02:32
bench_stringio2.py vstinner, 2012-08-11 15:31
Messages (12)
msg167850 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-08-10 02:30
Attached patch rewrites the C implementation of StringIO to use the _PyUnicodeWriter API instead of the PyAccu API. It provides better performance when writing non-ASCII strings.

The patch adds new functions:

 - _PyUnicodeWriter_Truncate()
 - _PyUnicodeWriter_WriteStrAt()
 - _PyUnicodeWriter_GetValue()
msg167851 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-08-10 02:32
Results of my micro benchmark. Use attached bench_stringio.py with benchmark.py:
https://bitbucket.org/haypo/misc/src/tip/python/benchmark.py

Command:
./python benchmark.py script bench_stringio.py

----

Common platform:
CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
Python unicode implementation: PEP 393
Platform: Linux-3.4.4-4.fc16.x86_64-x86_64-with-fedora-16-Verne
Bits: int=32, long=64, long long=64, pointer=64
CFLAGS: -Wno-unused-result -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes

Platform of campaign pyaccu:
Date: 2012-08-10 04:24:53
SCM: hg revision=aaa68dce117e tag=tip branch=default date="2012-08-09 21:38 +0200"
Python version: 3.3.0b1 (default:aaa68dce117e, Aug 10 2012, 04:24:19) [GCC 4.6.3 20120306 (Red Hat 4.6.3-2)]

Platform of campaign writer:
Date: 2012-08-10 04:23:21
SCM: hg revision=aaa68dce117e+ tag=tip branch=default date="2012-08-09 21:38 +0200"
Python version: 3.3.0b1 (default:aaa68dce117e+, Aug 10 2012, 04:18:39) [GCC 4.6.3 20120306 (Red Hat 4.6.3-2)]

--------------------------------------+-------------+---------------
Tests                                 |      pyaccu |         writer
--------------------------------------+-------------+---------------
writer ascii                          | 30.4 ms (*) |        30.4 ms
writer reader ascii                   | 37.1 ms (*) |          37 ms
writer latin1                         | 31.5 ms (*) |        30.6 ms
writer reader latin1                  | 38.6 ms (*) |        37.4 ms
writer bmp                            | 31.8 ms (*) |  29.7 ms (-7%)
writer reader bmp                     | 40.8 ms (*) | 36.6 ms (-10%)
writer non-bmp                        | 33.4 ms (*) | 30.2 ms (-10%)
writer reader non-bmp                 | 40.9 ms (*) | 36.7 ms (-10%)
writer long lines ascii               | 7.96 ms (*) |  7.34 ms (-8%)
writer-reader long lines ascii        | 8.16 ms (*) |  7.39 ms (-9%)
writer long lines latin1              | 8.01 ms (*) |   7.4 ms (-8%)
writer-reader long lines latin1       | 8.05 ms (*) |   7.4 ms (-8%)
writer long lines bmp                 |   14 ms (*) | 9.42 ms (-33%)
writer-reader long lines bmp          | 14.2 ms (*) | 9.45 ms (-34%)
writer long lines non-bmp             | 13.9 ms (*) | 9.62 ms (-31%)
writer-reader long lines non-bmp      | 14.3 ms (*) | 9.63 ms (-32%)
writer very long lines ascii          | 7.96 ms (*) |  7.36 ms (-7%)
writer-reader very long lines ascii   | 8.05 ms (*) |  7.37 ms (-8%)
writer very long lines latin1         | 7.98 ms (*) |  7.33 ms (-8%)
writer-reader very long lines latin1  |    8 ms (*) |  7.39 ms (-8%)
writer very long lines bmp            | 14.1 ms (*) | 9.34 ms (-34%)
writer-reader very long lines bmp     | 14.2 ms (*) |  9.4 ms (-34%)
writer very long lines non-bmp        | 13.9 ms (*) |  9.5 ms (-32%)
writer-reader very long lines non-bmp |   14 ms (*) | 9.61 ms (-31%)
reader ascii                          | 6.48 ms (*) |        6.22 ms
reader latin1                         | 6.59 ms (*) |        6.57 ms
reader bmp                            | 7.22 ms (*) |         6.9 ms
reader non-bmp                        | 7.65 ms (*) |        7.31 ms
--------------------------------------+-------------+---------------
Total                                 |  489 ms (*) |  431 ms (-12%)
--------------------------------------+-------------+---------------
msg167857 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-08-10 08:10
> It provides better performance when writing non-ASCII strings.

I would like to know why that is the case. If PyUnicode_Join is not optimal, then perhaps we should better optimize it.

Also, you should post benchmarks with tiny strings as well.
msg167858 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-08-10 08:12
> Also, you should post benchmarks with tiny strings as well.

Oops, sorry, they are already there. Thanks for the numbers.
msg167926 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-08-10 23:12
> I would like to know why that is the case.
> If PyUnicode_Join is not optimal, then perhaps we should
> better optimize it.

I don't know. _PyUnicodeWriter overallocates its buffer (+25%). It may reduce the number of realloc(), and so the number of times that the buffer is copied.
msg167927 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-08-10 23:20
> > I would like to know why that is the case.
> > If PyUnicode_Join is not optimal, then perhaps we should
> > better optimize it.
> 
> I don't know. _PyUnicodeWriter overallocates its buffer (+25%). It may
> reduce the number of realloc(), and so the number of times that the
> buffer is copied.

But PyUnicode_Join doesn't realloc() anything, since it creates a buffer
of exactly the right size. So this can't be the answer.
msg167950 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-08-11 10:10
Victor, your benchmark is buggy (it writes one character at a time). You should apply the following patch:

$ diff -u bench_stringio_orig.py bench_stringio.py 
--- bench_stringio_orig.py	2012-08-11 12:02:16.528321958 +0200
+++ bench_stringio.py	2012-08-11 12:05:53.939536902 +0200
@@ -41,8 +41,8 @@
         ('bmp', '\u20ac' * k + '\n'),
         ('non-bmp', '\U0010ffff' * k + '\n'),
     ):
-        bench.bench_func('writer long lines %s' % charset, writer, n // k, text)
-        bench.bench_func('writer-reader long lines %s' % charset, writer_reader, n // k, text)
+        bench.bench_func('writer long lines %s' % charset, writer, n, [text])
+        bench.bench_func('writer-reader long lines %s' % charset, writer_reader, n, [text])
 
     for charset, text in (
         ('ascii', 'a' * (n // 10) + '\n'),
@@ -50,8 +50,8 @@
         ('bmp', '\u20ac' * (n // 10) + '\n'),
         ('non-bmp', '\U0010ffff' * (n // 10) + '\n'),
     ):
-        bench.bench_func('writer very long lines %s' % charset, writer, 10, text)
-        bench.bench_func('writer-reader very long lines %s' % charset, writer_reader, 10, text)
+        bench.bench_func('writer very long lines %s' % charset, writer, 100, [text])
+        bench.bench_func('writer-reader very long lines %s' % charset, writer_reader, 100, [text])
 
     data = 'abc\n' * n
     bench.bench_func('reader ascii', reader, data)
msg167974 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-08-11 15:31
> Victor, your benchmark is buggy (it writes one character at a time).

Oh, it's not what I wanted to test.

I attach a new benchmark. Here are the results. PyAccu looks much more appropriate than _PyUnicodeWriter, because it is always faster, except to write 100.000 very long lines.

Common platform:
CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
CFLAGS: -Wno-unused-result -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes
Bits: int=32, long=64, long long=64, pointer=64
Python unicode implementation: PEP 393
Platform: Linux-3.4.4-4.fc16.x86_64-x86_64-with-fedora-16-Verne

Platform of campaign pyaccu:
SCM: hg revision=9804aec74d4a tag=tip branch=default date="2012-08-10 18:55 -0400"
Date: 2012-08-11 16:53:46
Python version: 3.3.0b1 (default:9804aec74d4a, Aug 11 2012, 16:53:12) [GCC 4.6.3 20120306 (Red Hat 4.6.3-2)]

Platform of campaign writer:
SCM: hg revision=9804aec74d4a+ tag=tip branch=default date="2012-08-10 18:55 -0400"
Date: 2012-08-11 16:50:40
Python version: 3.3.0b1 (default:9804aec74d4a+, Aug 11 2012, 16:33:18) [GCC 4.6.3 20120306 (Red Hat 4.6.3-2)]

--------------------------------------+-------------+---------------
10 lines                              |      pyaccu |         writer
--------------------------------------+-------------+---------------
reader short line ascii               | 1.53 us (*) |        1.46 us
writer short line ascii               | 4.85 us (*) |  4.48 us (-8%)
writer-reader short line ascii        | 6.45 us (*) | 5.71 us (-12%)
reader short line latin1              | 1.57 us (*) |  1.45 us (-8%)
writer short line latin1              | 4.92 us (*) |  4.56 us (-7%)
writer-reader short line latin1       |  6.6 us (*) | 5.78 us (-13%)
reader short line bmp                 | 1.64 us (*) |  1.54 us (-6%)
writer short line bmp                 | 5.01 us (*) | 4.43 us (-12%)
writer-reader short line bmp          | 6.68 us (*) | 5.71 us (-14%)
reader short line non-bmp             | 1.61 us (*) |        1.59 us
writer short line non-bmp             |  5.1 us (*) | 4.55 us (-11%)
writer-reader short line non-bmp      | 6.74 us (*) | 5.66 us (-16%)
reader long lines ascii               |  103 us (*) | 33.4 us (-68%)
writer long lines ascii               |  998 ns (*) |  836 ns (-16%)
writer-reader long lines ascii        | 1.45 us (*) | 1.18 us (-19%)
reader long lines latin1              |  105 us (*) | 34.2 us (-67%)
writer long lines latin1              |  997 ns (*) |  831 ns (-17%)
writer-reader long lines latin1       | 1.47 us (*) |  1.2 us (-18%)
reader long lines bmp                 |  121 us (*) | 85.9 us (-29%)
writer long lines bmp                 |  995 ns (*) |  861 ns (-13%)
writer-reader long lines bmp          | 1.43 us (*) | 1.13 us (-21%)
reader long lines non-bmp             | 97.1 us (*) |        99.7 us
writer long lines non-bmp             |    1 us (*) |  819 ns (-18%)
writer-reader long lines non-bmp      |  1.4 us (*) | 1.18 us (-16%)
reader very long lines ascii          | 1.42 us (*) |        1.45 us
writer very long lines ascii          | 3.04 us (*) |  2.88 us (-5%)
writer-reader very long lines ascii   | 4.59 us (*) | 4.12 us (-10%)
reader very long lines latin1         | 1.57 us (*) |  1.47 us (-7%)
writer very long lines latin1         | 3.04 us (*) | 2.73 us (-10%)
writer-reader very long lines latin1  | 4.66 us (*) | 4.04 us (-13%)
reader very long lines bmp            | 1.55 us (*) |        1.55 us
writer very long lines bmp            | 3.03 us (*) |        2.91 us
writer-reader very long lines bmp     | 4.72 us (*) | 4.08 us (-14%)
reader very long lines non-bmp        | 1.55 us (*) |        1.49 us
writer very long lines non-bmp        | 3.09 us (*) |  2.93 us (-5%)
writer-reader very long lines non-bmp | 4.59 us (*) | 4.06 us (-12%)
--------------------------------------+-------------+---------------
Total                                 |  525 us (*) |  342 us (-35%)
--------------------------------------+-------------+---------------

--------------------------------------+-------------+---------------
1000 lines                            |      pyaccu |         writer
--------------------------------------+-------------+---------------
reader short line ascii               | 68.2 us (*) |        66.1 us
writer short line ascii               |  308 us (*) |         307 us
writer-reader short line ascii        |  378 us (*) |         374 us
reader short line latin1              |   72 us (*) |        68.5 us
writer short line latin1              |  324 us (*) |         313 us
writer-reader short line latin1       |  395 us (*) |         383 us
reader short line bmp                 | 74.8 us (*) |        71.9 us
writer short line bmp                 |  326 us (*) |   303 us (-7%)
writer-reader short line bmp          |  397 us (*) |         378 us
reader short line non-bmp             | 72.9 us (*) |        72.6 us
writer short line non-bmp             |  329 us (*) |   304 us (-8%)
writer-reader short line non-bmp      |  397 us (*) |         383 us
reader long lines ascii               |  104 us (*) | 33.8 us (-67%)
writer long lines ascii               | 1.99 us (*) | 2.52 us (+27%)
writer-reader long lines ascii        | 4.37 us (*) | 3.45 us (-21%)
reader long lines latin1              |  104 us (*) | 33.3 us (-68%)
writer long lines latin1              | 2.07 us (*) | 2.55 us (+23%)
writer-reader long lines latin1       | 4.51 us (*) | 3.57 us (-21%)
reader long lines bmp                 |  120 us (*) | 80.5 us (-33%)
writer long lines bmp                 | 2.15 us (*) | 2.55 us (+18%)
writer-reader long lines bmp          | 4.71 us (*) | 3.86 us (-18%)
reader long lines non-bmp             | 90.6 us (*) |  97.6 us (+8%)
writer long lines non-bmp             | 2.18 us (*) | 2.68 us (+23%)
writer-reader long lines non-bmp      | 4.24 us (*) |        4.05 us
reader very long lines ascii          | 2.53 us (*) | 1.66 us (-34%)
writer very long lines ascii          | 3.07 us (*) | 3.46 us (+13%)
writer-reader very long lines ascii   | 6.18 us (*) | 4.89 us (-21%)
reader very long lines latin1         | 2.57 us (*) | 1.75 us (-32%)
writer very long lines latin1         | 3.16 us (*) | 3.46 us (+10%)
writer-reader very long lines latin1  | 6.32 us (*) | 4.98 us (-21%)
reader very long lines bmp            |  2.7 us (*) | 2.34 us (-14%)
writer very long lines bmp            | 3.52 us (*) |        3.65 us
writer-reader very long lines bmp     | 6.73 us (*) |  5.7 us (-15%)
reader very long lines non-bmp        | 2.45 us (*) |        2.35 us
writer very long lines non-bmp        | 3.47 us (*) | 3.87 us (+12%)
writer-reader very long lines non-bmp | 5.98 us (*) |        5.85 us
--------------------------------------+-------------+---------------
Total                                 | 3.63 ms (*) |  3.34 ms (-8%)
--------------------------------------+-------------+---------------

--------------------------------------+-------------+---------------
100000 lines                          |      pyaccu |         writer
--------------------------------------+-------------+---------------
reader short line ascii               | 6.74 ms (*) |        6.43 ms
writer short line ascii               | 30.7 ms (*) |        29.8 ms
writer-reader short line ascii        | 37.5 ms (*) |        36.6 ms
reader short line latin1              | 7.08 ms (*) |  6.64 ms (-6%)
writer short line latin1              | 31.3 ms (*) |        30.1 ms
writer-reader short line latin1       | 38.8 ms (*) |        37.5 ms
reader short line bmp                 | 7.46 ms (*) |  6.98 ms (-6%)
writer short line bmp                 |   32 ms (*) |    29 ms (-9%)
writer-reader short line bmp          | 40.5 ms (*) | 35.9 ms (-11%)
reader short line non-bmp             | 7.36 ms (*) |        7.23 ms
writer short line non-bmp             | 33.3 ms (*) | 29.4 ms (-12%)
writer-reader short line non-bmp      | 40.5 ms (*) | 36.5 ms (-10%)
reader long lines ascii               |  103 us (*) | 32.6 us (-68%)
writer long lines ascii               | 59.4 us (*) | 66.5 us (+12%)
writer-reader long lines ascii        |  220 us (*) | 99.2 us (-55%)
reader long lines latin1              |  105 us (*) | 32.2 us (-69%)
writer long lines latin1              | 60.2 us (*) | 67.3 us (+12%)
writer-reader long lines latin1       |  240 us (*) | 97.6 us (-59%)
reader long lines bmp                 |  122 us (*) | 76.9 us (-37%)
writer long lines bmp                 | 62.1 us (*) | 73.8 us (+19%)
writer-reader long lines bmp          |  242 us (*) |  151 us (-38%)
reader long lines non-bmp             | 95.7 us (*) |        92.1 us
writer long lines non-bmp             | 76.5 us (*) | 90.3 us (+18%)
writer-reader long lines non-bmp      |  198 us (*) |  173 us (-12%)
reader very long lines ascii          | 91.6 us (*) | 11.5 us (-87%)
writer very long lines ascii          | 7.15 us (*) | 11.9 us (+67%)
writer-reader very long lines ascii   |  145 us (*) | 20.1 us (-86%)
reader very long lines latin1         |  110 us (*) |   12 us (-89%)
writer very long lines latin1         | 7.52 us (*) | 12.1 us (+61%)
writer-reader very long lines latin1  |  165 us (*) | 20.7 us (-87%)
reader very long lines bmp            | 91.1 us (*) | 46.7 us (-49%)
writer very long lines bmp            | 12.3 us (*) | 22.5 us (+82%)
writer-reader very long lines bmp     |  150 us (*) | 61.9 us (-59%)
reader very long lines non-bmp        | 66.8 us (*) |        66.6 us
writer very long lines non-bmp        | 22.4 us (*) | 38.4 us (+72%)
writer-reader very long lines non-bmp |  108 us (*) | 87.7 us (-19%)
--------------------------------------+-------------+---------------
Total                                 |  316 ms (*) |   294 ms (-7%)
--------------------------------------+-------------+---------------

-------------+-------------+--------------
Summary      |      pyaccu |        writer
-------------+-------------+--------------
10 lines     |  525 us (*) | 342 us (-35%)
1000 lines   | 3.63 ms (*) | 3.34 ms (-8%)
100000 lines |  316 ms (*) |  294 ms (-7%)
-------------+-------------+--------------
Total        |  320 ms (*) |  297 ms (-7%)
-------------+-------------+--------------
msg167975 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-08-11 15:35
"PyAccu looks much more appropriate than _PyUnicodeWriter, because it is always faster, except to write 100.000 very long lines."

Oh... I added colors to my tool, but there was a bug: I used the wrong colors... It's just the opposite.

_PyUnicodeWriter is almost always faster, except to write more than 100.000 very long lines.
msg167977 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-08-11 16:19
> _PyUnicodeWriter is almost always faster

Actually, PyAccu is consistently faster for the "writer" case, while _PyUnicodeWriter is faster for the "writer-reader" case.
This is not because of PyAccu, but because of the way StringIO uses it: when e.g. readline() is called, the PyAccu result is converted into a PyUCS4* buffer, then each readline() result is converted again by finding the max char in the sub-buffer.

So I would suggest using PyAccu, but converting its result to a _PyUnicodeWriter rather than a PyUCS4* buffer.
msg167978 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-08-11 16:45
See benchmark results in issue15381 (the patch is not applicable to StringIO). These numbers show that resize strategy can be much slower append/join strategy on Windows.
msg238415 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-03-18 11:04
I'm no more interested to work on this issue, and it's not clear that _PyUnicodeWriter is always faster. Switch from a list to _PyUnicodeWriter on a specific event would make the code much more complex. I prefer to just close the issue.
History
Date User Action Args
2015-03-18 11:04:56vstinnersetstatus: open -> closed
resolution: out of date
messages: + msg238415
2012-09-25 00:18:34Arfreversetnosy: + Arfrever
2012-08-11 16:45:34serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg167978
2012-08-11 16:19:35pitrousetmessages: + msg167977
2012-08-11 15:35:26vstinnersetmessages: + msg167975
2012-08-11 15:31:19vstinnersetfiles: + bench_stringio2.py

messages: + msg167974
2012-08-11 10:10:55pitrousetmessages: + msg167950
2012-08-10 23:20:23pitrousetmessages: + msg167927
2012-08-10 23:12:28vstinnersetmessages: + msg167926
2012-08-10 08:12:08pitrousetmessages: + msg167858
2012-08-10 08:10:01pitrousetmessages: + msg167857
2012-08-10 02:33:33vstinnersettitle: Rewriter StringIO to use the _PyUnicodeWriter API -> Rewrite StringIO to use the _PyUnicodeWriter API
2012-08-10 02:33:22vstinnersettype: performance
2012-08-10 02:32:19vstinnersetfiles: + bench_stringio.py

messages: + msg167851
2012-08-10 02:30:29vstinnercreate