Message 187035 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	serhiy.storchaka, vstinner
Date	2013-04-15.22:03:32
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1366063415.31.0.505871872027.issue17742@psf.upfronthosting.co.za>
In-reply-to

Content
In Python 3.3, I added _PyUnicodeWriter API to factorize code handling a Unicode "buffer", just the code to allocate memory and resize the buffer if needed. I propose to do the same with a new _PyBytesWriter API. The API is very similar to _PyUnicodeWriter: * _PyBytesWriter_Init(writer) * _PyBytesWriter_Prepare(writer, count) * _PyBytesWriter_WriteStr(writer, bytes_obj) * _PyBytesWriter_WriteChar(writer, ch) * _PyBytesWriter_Finish(writer) * _PyBytesWriter_Dealloc(writer) The patch changes ASCII, Latin1, UTF-8 and charmap encoders to use _PyBytesWriter API. A second patch changes CJK encoders. I did not run a benchmark yet. I wrote a patch to factorize the code, not the make the code faster. Notes on performances: * I peek the "small buffer allocated on the stack" idea from UTF-8 encoder, but the smaller buffer is always 500 bytes (instead of a size depending on the Unicode maximum character of the input Unicode string) * _PyBytesWriter overallocates by 25% (when overallocation is enabled), whereas charmap encoders doubles the buffer: /* exponentially overallocate to minimize reallocations / if (requiredsize < 2outsize) requiredsize = 2outsize; I didn't check if the allocation size is the same with the patch. min_size and overallocate attributes should be set correctly to not make the code slower. * The code writing a single into a _PyUnicodeWriter buffer is inlined in unicodeobject.c. _PyBytesWriter API does not provide inlined function for the same purpose.

In Python 3.3, I added _PyUnicodeWriter API to factorize code handling a Unicode "buffer", just the code to allocate memory and resize the buffer if needed.

I propose to do the same with a new _PyBytesWriter API. The API is very similar to _PyUnicodeWriter:

* _PyBytesWriter_Init(writer)
* _PyBytesWriter_Prepare(writer, count)
* _PyBytesWriter_WriteStr(writer, bytes_obj)
* _PyBytesWriter_WriteChar(writer, ch)
* _PyBytesWriter_Finish(writer)
* _PyBytesWriter_Dealloc(writer)

The patch changes ASCII, Latin1, UTF-8 and charmap encoders to use _PyBytesWriter API. A second patch changes CJK encoders.

I did not run a benchmark yet. I wrote a patch to factorize the code, not the make the code faster.

Notes on performances:

* I peek the "small buffer allocated on the stack" idea from UTF-8 encoder, but the smaller buffer is always 500 bytes (instead of a size depending on the Unicode maximum character of the input Unicode string)
* _PyBytesWriter overallocates by 25% (when overallocation is enabled), whereas charmap encoders doubles the buffer:

/* exponentially overallocate to minimize reallocations */
if (requiredsize < 2*outsize)
requiredsize = 2*outsize;

* I didn't check if the allocation size is the same with the patch. min_size and overallocate attributes should be set correctly to not make the code slower.
* The code writing a single into a _PyUnicodeWriter buffer is inlined in unicodeobject.c. _PyBytesWriter API does not provide inlined function for the same purpose.

History
Date	User	Action	Args
2013-04-15 22:03:35	vstinner	set	recipients: + vstinner, serhiy.storchaka
2013-04-15 22:03:35	vstinner	set	messageid: <1366063415.31.0.505871872027.issue17742@psf.upfronthosting.co.za>
2013-04-15 22:03:35	vstinner	link	issue17742 messages
2013-04-15 22:03:34	vstinner	create