classification
Title: Add _PyBytesWriter API
Type: enhancement Stage: patch review
Components: Interpreter Core Versions: Python 3.4
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: pitrou, r.david.murray, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2013-04-15 22:03 by vstinner, last changed 2013-05-16 21:26 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
bytes_writer.patch vstinner, 2013-04-15 22:03 review
bytes_writer_cjkcodecs.patch vstinner, 2013-04-15 22:03 review
bench_encoders.compare vstinner, 2013-04-18 01:08
bench_encoders.py vstinner, 2013-04-18 21:18
bytes_writer-2.patch vstinner, 2013-04-18 22:38 review
bytes_writer-3.patch vstinner, 2013-04-21 02:03 review
Messages (18)
msg187035 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-15 22:03
In Python 3.3, I added _PyUnicodeWriter API to factorize code handling a Unicode "buffer", just the code to allocate memory and resize the buffer if needed.

I propose to do the same with a new _PyBytesWriter API. The API is very similar to _PyUnicodeWriter:

 * _PyBytesWriter_Init(writer)
 * _PyBytesWriter_Prepare(writer, count)
 * _PyBytesWriter_WriteStr(writer, bytes_obj)
 * _PyBytesWriter_WriteChar(writer, ch)
 * _PyBytesWriter_Finish(writer)
 * _PyBytesWriter_Dealloc(writer)

The patch changes ASCII, Latin1, UTF-8 and charmap encoders to use _PyBytesWriter API. A second patch changes CJK encoders.

I did not run a benchmark yet. I wrote a patch to factorize the code, not the make the code faster.

Notes on performances:

 * I peek the "small buffer allocated on the stack" idea from UTF-8 encoder, but the smaller buffer is always 500 bytes (instead of a size depending on the Unicode maximum character of the input Unicode string)
 * _PyBytesWriter overallocates by 25% (when overallocation is enabled), whereas charmap encoders doubles the buffer:

    /* exponentially overallocate to minimize reallocations */
    if (requiredsize < 2*outsize)
        requiredsize = 2*outsize;

 * I didn't check if the allocation size is the same with the patch. min_size and overallocate attributes should be set correctly to not make the code slower.
 * The code writing a single into a _PyUnicodeWriter buffer is inlined in unicodeobject.c. _PyBytesWriter API does not provide inlined function for the same purpose.
msg187036 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-15 22:10
See also #17694.
msg187053 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-04-16 07:51
Could you please provide the benchmarks results? I am afraid that it may hit a performance. Results on Windows are especially interesting.
msg187057 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-04-16 09:02
> I did not run a benchmark yet. I wrote a patch to factorize the code, 
> not the make the code faster.

Your patches don't seem to reduce the line count, so I don't understand the point.
msg187217 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-18 01:08
Here is a benchmark. It looks like the overallocation is not configured correctly for charmap and UTF-8 encoders: it is always enabled for UTF-8 encoder (whereas it should only be enabled on the first call to an error handler), and never enabled for charmap encoder.
msg187263 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-04-18 16:05
I run my own benchmarks and don't see any regression besides a random noise. Actually I see even speed up for some cases:

./python -m timeit -s "a = '\x80'*10000"  "a.encode()"

Before patch: 29.8 usec per loop, after patch: 21.5 usec per loop.
This is just a compiler's caprice.
msg187274 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-04-18 17:49
I have added some comments on Rietveld.
msg187297 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-18 21:18
Benchmark script, should be used with:
https://bitbucket.org/haypo/misc/src/tip/python/benchmark.py
msg187304 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-18 22:38
New version of the patch:
 - address most (all?) Serhiy's remarks
 - _PyBytesWriter_PrepareInternal() always use min_size, not only when overallocate is 1
 - add more comments

Performances are almost the same than without the patch. It looks like they are a little bit better.
msg187384 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-04-19 20:18
_PyBytesWriter and _PyUnicodeWriter have differen use cases. While _PyUnicodeWriter used primary in formatter where resulting size is rarely known and reallocation in decoders usually caused by widening result, _PyBytesWriter is used only in decoders where we usually can estimate a result size or it's upper bound. Resizing happened only in exceptional cases, when error handler called.

The patch contains a special case for writing only one bytes object. This is very unlikely case. It happened only when an encoded string contains only one illegal character. I think this case is not worth a complication and
obfuscation of the code. I think we should drop readonly attribute and a piece of the code (which looks buggy for me anyway).
msg187476 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-20 23:09
> The patch contains a special case for writing only one bytes object.
> This is very unlikely case.

The patch only modify a few functions to make them use the new _PyBytesWriter API. Other functions can use it.

A few examples:

 - PyBytes_FromObject()
 - binascii: binascii_rledecode_hqx()
 - bz2, lzma and zlib modules
 - marshal and pickle modules
 - datetime.datetime.strftime()
 - Python/compile.c: assemble_lnotab()
 - more Unicode decoders

But I agree that the readonly "hack" can be removed from _PyBytesWriter API since the bytes type has no format method (no bytes%args nor bytes.format(args)).
msg187478 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-04-21 00:11
It may eventually get one, though.
msg187480 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-21 00:59
> It may eventually get one, though.

If a use case for the "read-only hack" comes, the hack can be added again later. It's better to start with something simple and extend it with new use cases.
msg187484 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-21 02:03
I'm not completly satisfied of bytes_writer-2.patch. Most encoders work directly on a pointer (char*). If I want to keep the code using pointers (because it is efficient), I have to resynchronize the writer and the pointer before and after calling writer methods.

Here is a new patch implementing a different approach, closer to the current code. The patch version 3 has no "pos" attribute: the "position" is now a pointer (str). So encoders can just use this pointer instead of their own pointer.

I expect that replacing "*p++ = c;" with "*writer.str++ = c;" would not add an important overhead, especially because writer is a local variable, and str is the first attribute of the structure. I hope that the machine code will be exactly the same.
msg187522 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-04-21 20:28
The last patch increases the size of the code substantially. I'm still wondering what the benefits are.

$ hg di --stat
 Include/bytesobject.h      |   90 ++++++++++++++
 Misc/NEWS                  |    3 +
 Objects/bytesobject.c      |  144 ++++++++++++++++++++++
 Objects/stringlib/codecs.h |  109 ++++------------
 Objects/unicodeobject.c    |  267 ++++++++++++++++-------------------------
 5 files changed, 368 insertions(+), 245 deletions(-)
msg187595 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-22 22:00
"I expect that replacing "*p++ = c;" with "*writer.str++ = c;" would
not add an important overhead, especially because writer is a local
variable, and str is the first attribute of the structure. I hope that
the machine code will be exactly the same."

I ran some benchmarks: performances are better in some cases, but
worse in other cases. Performances are worse for simple loop like:

                while (collstart++<collend)
                    *writer.str++ = '?';

For the "writer.str++" instruction, the new pointer value is written
immediatly in the structure. The pointer value is also read again at
each iteration. So we have 1 load and 1 store per iteration. I read
the assembler code on x86_64 using GCC 4.2.1 of Mac OS 10.6.

I compared with the original code:

                while (collstart++<collend)
                    *str++ = '?';

GCC emits better code: str is stored in a register and the new value
of str is only written once, at the end of loop (instead of writing it
at each iteration). The pointer value is read before the loop. So we
have 0 load and 0 store (related to the pointer value) in the body of
the loop.

It may be an aliasing issue, but I didn't find how to say to GCC that
the new value of writer.str can be written only once at the end of the
loop. I tried to add __restrict__ keyword: the load (get the pointer
value) is moved out of the loop. But the store is still in the body of
the loop.
msg187944 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-27 22:28
Advantages of the patch.

* finer control on how the buffer is allocated: only overallocate if the replacement string (while handling an encoding error) is longer than 1 byte/character. The "replace" error handler should never use overallocation for example. Overallocation (when misused, when it was not needed) has a cost at the end of the encoder, because the buffer must be resized (shrink)

* use a buffer allocated on the stack for short strings. I'm not really convinced of this optimization. The data is still copied when the result is converted to a bytes objects (PyBytes_FromStringAndSize). It may be interesting if the encoder has to handle one or more errors: no need to resize the buffer until we reach the size of the small buffer (ex: 512 bytes).

* handle correctly integer overflow: most encoders do not catch integer overflow errors and may fail to handle (very) long strings (ex: encoded string longer than PY_SSIZE_T_MAX).

I'm not convinced that the patch would permit to design faster code. According to the assembler, it is the opposite (when "*writer.str++" is used in a loop). I don't know if it's possible to design a more efficient _PyBytesWriter API (to help GCC to generate more efficient machine code), nor if the overhead is important in a "normal case" (bench_encoders.py tests border cases, text with many many errors).
msg189415 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-05-16 21:26
_PyBytesWriter API makes the code slower and does not really reduce the number of lines, so I'm closing this issue as invalid.
History
Date User Action Args
2013-05-16 21:26:58vstinnersetstatus: open -> closed
resolution: not a bug
messages: + msg189415
2013-04-27 22:29:18vstinnersetmessages: - msg187943
2013-04-27 22:28:56vstinnersetmessages: + msg187944
2013-04-27 22:21:08vstinnersetmessages: + msg187943
2013-04-22 22:00:47vstinnersetmessages: + msg187595
2013-04-21 20:28:59pitrousetmessages: + msg187522
2013-04-21 02:03:28vstinnersetfiles: + bytes_writer-3.patch

messages: + msg187484
2013-04-21 00:59:11vstinnersetmessages: + msg187480
2013-04-21 00:11:05r.david.murraysetnosy: + r.david.murray
messages: + msg187478
2013-04-20 23:09:12vstinnersetmessages: + msg187476
2013-04-19 20:18:53serhiy.storchakasetmessages: + msg187384
2013-04-18 22:38:26vstinnersetfiles: + bytes_writer-2.patch

messages: + msg187304
2013-04-18 21:18:09vstinnersetfiles: + bench_encoders.py

messages: + msg187297
2013-04-18 17:49:39serhiy.storchakasetmessages: + msg187274
2013-04-18 16:05:07serhiy.storchakasetmessages: + msg187263
2013-04-18 01:08:07vstinnersetfiles: + bench_encoders.compare

messages: + msg187217
2013-04-16 09:02:44pitrousetmessages: + msg187057
2013-04-16 07:52:13serhiy.storchakasetnosy: + pitrou
2013-04-16 07:51:36serhiy.storchakasettype: enhancement
messages: + msg187053
components: + Interpreter Core
stage: patch review
2013-04-15 22:10:19vstinnersetmessages: + msg187036
2013-04-15 22:03:50vstinnersetfiles: + bytes_writer_cjkcodecs.patch
2013-04-15 22:03:35vstinnercreate