Message 187944 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	pitrou, r.david.murray, serhiy.storchaka, vstinner
Date	2013-04-27.22:28:56
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1367101736.74.0.746220323103.issue17742@psf.upfronthosting.co.za>
In-reply-to

Content
Advantages of the patch. * finer control on how the buffer is allocated: only overallocate if the replacement string (while handling an encoding error) is longer than 1 byte/character. The "replace" error handler should never use overallocation for example. Overallocation (when misused, when it was not needed) has a cost at the end of the encoder, because the buffer must be resized (shrink) * use a buffer allocated on the stack for short strings. I'm not really convinced of this optimization. The data is still copied when the result is converted to a bytes objects (PyBytes_FromStringAndSize). It may be interesting if the encoder has to handle one or more errors: no need to resize the buffer until we reach the size of the small buffer (ex: 512 bytes). * handle correctly integer overflow: most encoders do not catch integer overflow errors and may fail to handle (very) long strings (ex: encoded string longer than PY_SSIZE_T_MAX). I'm not convinced that the patch would permit to design faster code. According to the assembler, it is the opposite (when "*writer.str++" is used in a loop). I don't know if it's possible to design a more efficient _PyBytesWriter API (to help GCC to generate more efficient machine code), nor if the overhead is important in a "normal case" (bench_encoders.py tests border cases, text with many many errors).

Advantages of the patch.

* finer control on how the buffer is allocated: only overallocate if the replacement string (while handling an encoding error) is longer than 1 byte/character. The "replace" error handler should never use overallocation for example. Overallocation (when misused, when it was not needed) has a cost at the end of the encoder, because the buffer must be resized (shrink)

* use a buffer allocated on the stack for short strings. I'm not really convinced of this optimization. The data is still copied when the result is converted to a bytes objects (PyBytes_FromStringAndSize). It may be interesting if the encoder has to handle one or more errors: no need to resize the buffer until we reach the size of the small buffer (ex: 512 bytes).

* handle correctly integer overflow: most encoders do not catch integer overflow errors and may fail to handle (very) long strings (ex: encoded string longer than PY_SSIZE_T_MAX).

I'm not convinced that the patch would permit to design faster code. According to the assembler, it is the opposite (when "*writer.str++" is used in a loop). I don't know if it's possible to design a more efficient _PyBytesWriter API (to help GCC to generate more efficient machine code), nor if the overhead is important in a "normal case" (bench_encoders.py tests border cases, text with many many errors).

History
Date	User	Action	Args
2013-04-27 22:28:56	vstinner	set	recipients: + vstinner, pitrou, r.david.murray, serhiy.storchaka
2013-04-27 22:28:56	vstinner	set	messageid: <1367101736.74.0.746220323103.issue17742@psf.upfronthosting.co.za>
2013-04-27 22:28:56	vstinner	link	issue17742 messages
2013-04-27 22:28:56	vstinner	create