New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite StringIO to use the _PyUnicodeWriter API #59817
Comments
Attached patch rewrites the C implementation of StringIO to use the _PyUnicodeWriter API instead of the PyAccu API. It provides better performance when writing non-ASCII strings. The patch adds new functions:
|
Results of my micro benchmark. Use attached bench_stringio.py with benchmark.py: Command: ---- Common platform: Platform of campaign pyaccu: Platform of campaign writer: --------------------------------------+-------------+--------------- |
I would like to know why that is the case. If PyUnicode_Join is not optimal, then perhaps we should better optimize it. Also, you should post benchmarks with tiny strings as well. |
Oops, sorry, they are already there. Thanks for the numbers. |
I don't know. _PyUnicodeWriter overallocates its buffer (+25%). It may reduce the number of realloc(), and so the number of times that the buffer is copied. |
But PyUnicode_Join doesn't realloc() anything, since it creates a buffer |
Victor, your benchmark is buggy (it writes one character at a time). You should apply the following patch: $ diff -u bench_stringio_orig.py bench_stringio.py
--- bench_stringio_orig.py 2012-08-11 12:02:16.528321958 +0200
+++ bench_stringio.py 2012-08-11 12:05:53.939536902 +0200
@@ -41,8 +41,8 @@
('bmp', '\u20ac' * k + '\n'),
('non-bmp', '\U0010ffff' * k + '\n'),
):
- bench.bench_func('writer long lines %s' % charset, writer, n // k, text)
- bench.bench_func('writer-reader long lines %s' % charset, writer_reader, n // k, text)
+ bench.bench_func('writer long lines %s' % charset, writer, n, [text])
+ bench.bench_func('writer-reader long lines %s' % charset, writer_reader, n, [text])
for charset, text in (
('ascii', 'a' * (n // 10) + '\n'),
@@ -50,8 +50,8 @@
('bmp', '\u20ac' * (n // 10) + '\n'),
('non-bmp', '\U0010ffff' * (n // 10) + '\n'),
):
- bench.bench_func('writer very long lines %s' % charset, writer, 10, text)
- bench.bench_func('writer-reader very long lines %s' % charset, writer_reader, 10, text)
+ bench.bench_func('writer very long lines %s' % charset, writer, 100, [text])
+ bench.bench_func('writer-reader very long lines %s' % charset, writer_reader, 100, [text])
data = 'abc\n' * n
bench.bench_func('reader ascii', reader, data) |
Oh, it's not what I wanted to test. I attach a new benchmark. Here are the results. PyAccu looks much more appropriate than _PyUnicodeWriter, because it is always faster, except to write 100.000 very long lines. Common platform: Platform of campaign pyaccu: Platform of campaign writer: --------------------------------------+-------------+--------------- --------------------------------------+-------------+--------------- --------------------------------------+-------------+--------------- -------------+-------------+-------------- |
"PyAccu looks much more appropriate than _PyUnicodeWriter, because it is always faster, except to write 100.000 very long lines." Oh... I added colors to my tool, but there was a bug: I used the wrong colors... It's just the opposite. _PyUnicodeWriter is almost always faster, except to write more than 100.000 very long lines. |
Actually, PyAccu is consistently faster for the "writer" case, while _PyUnicodeWriter is faster for the "writer-reader" case. So I would suggest using PyAccu, but converting its result to a _PyUnicodeWriter rather than a PyUCS4* buffer. |
See benchmark results in bpo-15381 (the patch is not applicable to StringIO). These numbers show that resize strategy can be much slower append/join strategy on Windows. |
I'm no more interested to work on this issue, and it's not clear that _PyUnicodeWriter is always faster. Switch from a list to _PyUnicodeWriter on a specific event would make the code much more complex. I prefer to just close the issue. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: