Issue 16311: Use _PyUnicodeWriter API in text decoders

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/60515

classification

Title:	Use _PyUnicodeWriter API in text decoders
Type:	performance	Stage:	resolved
Components:		Versions:	Python 3.4

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	loewis, python-dev, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	patch

Created on 2012-10-24 18:38 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
codecs_writer.patch	vstinner, 2012-10-24 18:38		review
codecs_writer_2.patch	serhiy.storchaka, 2012-10-31 13:10		review
decodebench.res	serhiy.storchaka, 2012-10-31 13:14	Benchmark results

Messages (9)
msg173695 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-10-24 18:38
Attached patch modifies text decoders to use the _PyUnicodeWriter API to factorize the code. It removes unicode_widen() and unicode_putchar() functions. * Don't overallocate by default (except for "raw-unicode-escape" codec), enable overallocation on the first decode error (as done currently) * _PyUnicodeWriter_Prepare() only overallocates 25%, instead of 100% for unicode_decode_call_errorhandler() * Use _PyUnicodeWriter_Prepare() + PyUnicode_WRITE() (two macros) instead of unicode_putchar() (function) * _PyUnicodeWriter structures stores many useful fields, so we don't have to pass multiple parameters to functions, only the writer I wrote the patch to factorize the code, but it might be faster.
msg173697 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-10-24 19:44
Soon I'll post a patch, which speeds up unicode-escape and raw-unicode-escape decoders to 1.5-3x. Also there are not yet reviewed patches for UTF-32 (issue14625) and charmap (issue14850) decoders. Will be merge conflicts. But I will review the patch.
msg174171 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-10-30 01:02
"Soon I'll post a patch, which speeds up unicode-escape and raw-unicode-escape decoders to 1.5-3x. Also there are not yet reviewed patches for UTF-32 (issue14625) and charmap (issue14850) decoders. Will be merge conflicts." codecs_writer.patch doesn't change too much the core of decoders, but mostly the code before and after the loop, and error handling. You can still use PyUnicode_WRITE, PyUnicode_READ, memcpy(), etc. "But I will review the patch." If you review the patch, please check that how the buffer is allocated. It should not be overallocated by default, only on the first error. Overallocation can kill performances when it is not necessary (especially on Windows).
msg174238 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-10-30 23:17
I will do some experiments and review tomorrow.
msg174273 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-10-31 12:50
I updated the patch to resolve the conflict with issue14625.
msg174275 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-10-31 13:58
With the patch UTF-8 decoder 20% slower for some data. UTF-16 decoder 20% faster for some data and 20% slower for other data. UTF-32 decoder slower for many data (even after some optimization, naive code was up to 50% slower). Standard charmap decoder 10% slower. Only UTF-7, unicode-escape and raw-unicode-escape have become much faster (unicode-escape and raw-unicode-escape as with issue16334 patch). A well optimized decoders do not benefit from the _PyUnicodeWriter, only a slight slowdown. The patch requires some optimization (as for UTF-32 decoder) to reduce the negative effect. Non-optimized decoders will receive the great benefit.
msg174293 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-10-31 15:30
I ran decodebench.py and bench-diff.py scripts from #14624, I just replaced repeat=10 with repeat=100 to get more reliable numbers. I only see some performance regressions between -5% and -1%, but there are some speedup on UTF-8 and UTF-32 (between +11% and +14%). On a microbenchmark, numbers in the -10..10% range just means "no change". Using _PyUnicodeWriter should not change anything to performances on valid data, only performances of handling decoding errors between the overallocation factor is different, the code to widen the buffer and the code to write replacement characters.
msg175034 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-11-06 23:41
New changeset 7ed9993d53b4 by Victor Stinner in branch 'default': Close #16311: Use the _PyUnicodeWriter API in text decoders http://hg.python.org/cpython/rev/7ed9993d53b4
msg175129 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-11-07 22:53
Oh, I forgot my benchmark results. decodebench.py result results on Linux 32 bits: (Linux-3.2.0-32-generic-pae-i686-with-debian-wheezy-sid) $ ./python bench-diff.py original writer ascii 'A'10000 4109 (-3%) 3974 latin1 'A'10000 3851 (-5%) 3644 latin1 '\x80'10000 14832 (-3%) 14430 utf-8 'A'10000 3747 (-4%) 3608 utf-8 '\x80'10000 976 (-2%) 961 utf-8 '\u0100'10000 974 (-2%) 959 utf-8 '\u8000'10000 804 (-14%) 694 utf-8 '\U00010000'10000 666 (-5%) 635 utf-16le 'A'10000 4154 (-1%) 4117 utf-16le '\x80'10000 4055 (-2%) 3988 utf-16le '\u0100'10000 4047 (-2%) 3974 utf-16le '\u8000'10000 917 (-1%) 912 utf-16le '\U00010000'10000 872 (-0%) 870 utf-16be 'A'10000 3218 (-1%) 3185 utf-16be '\x80'10000 3163 (-2%) 3114 utf-16be '\u0100'10000 2591 (-1%) 2556 utf-16be '\u8000'10000 979 (-1%) 974 utf-16be '\U00010000'10000 928 (-0%) 925 utf-32le 'A'10000 1681 (+12%) 1885 utf-32le '\x80'10000 1697 (+10%) 1865 utf-32le '\u0100'10000 2224 (+1%) 2254 utf-32le '\u8000'10000 2224 (+2%) 2269 utf-32le '\U00010000'10000 2234 (+1%) 2260 utf-32be 'A'10000 1685 (+11%) 1868 utf-32be '\x80'10000 1684 (+10%) 1860 utf-32be '\u0100'10000 2223 (+1%) 2253 utf-32be '\u8000'10000 2222 (+1%) 2255 utf-32be '\U00010000'10000 2243 (+1%) 2257 decodebench.py result results on Linux 64 bits: (Linux-3.4.9-2.fc16.x86_64-x86_64-with-fedora-16-Verne) ascii 'A'10000 10043 (+1%) 10144 latin1 'A'10000 8351 (-1%) 8258 latin1 '\x80'10000 19184 (+2%) 19560 utf-8 'A'10000 8083 (+5%) 8461 utf-8 '\x80'10000 982 (+1%) 993 utf-8 '\u0100'10000 984 (+1%) 992 utf-8 '\u8000'10000 806 (+31%) 1053 utf-8 '\U00010000'10000 639 (+12%) 718 utf-16le 'A'10000 5547 (-2%) 5422 utf-16le '\x80'10000 5205 (+1%) 5271 utf-16le '\u0100'10000 4900 (-4%) 4695 utf-16le '\u8000'10000 1062 (+9%) 1154 utf-16le '\U00010000'10000 1040 (+4%) 1078 utf-16be 'A'10000 5416 (-5%) 5157 utf-16be '\x80'10000 5077 (-1%) 5011 utf-16be '\u0100'10000 4261 (-1%) 4218 utf-16be '\u8000'10000 1146 (+0%) 1147 utf-16be '\U00010000'10000 1125 (-1%) 1119 utf-32le 'A'10000 1743 (+8%) 1880 utf-32le '\x80'10000 1751 (+5%) 1842 utf-32le '\u0100'10000 2114 (+29%) 2721 utf-32le '\u8000'10000 2120 (+28%) 2718 utf-32le '\U00010000'10000 2065 (+30%) 2690 utf-32be 'A'10000 1761 (+6%) 1860 utf-32be '\x80'10000 1749 (+6%) 1856 utf-32be '\u0100'10000 2101 (+29%) 2715 utf-32be '\u8000'10000 2083 (+30%) 2715 utf-32be '\U00010000'10000 2058 (+31%) 2689 Most significant changes: * -14% to decode '\u8000'10000 from UTF-8 on Linux 32 bits +31% to decode '\u8000'10000 from UTF-8 on Linux 32 bits +28% to +31% to decode UCS-2 and UCS-4 characters from UTF-8 on Linux 32 bits @Serhiy Storchaka: If you feel able to tune _PyUnicodeWriter to improve its performance, please open a new issue. I consider the performance changes acceptable and I don't plan to work on this topic.

History
Date	User	Action	Args
2022-04-11 14:57:37	admin	set	github: 60515
2012-11-07 22:53:54	vstinner	set	messages: + msg175129
2012-11-06 23:41:02	python-dev	set	status: open -> closed nosy: + python-dev messages: + msg175034 resolution: fixed stage: resolved
2012-10-31 15:30:41	vstinner	set	messages: + msg174293
2012-10-31 13:58:45	serhiy.storchaka	set	messages: + msg174275
2012-10-31 13:14:31	serhiy.storchaka	set	files: + decodebench.res
2012-10-31 13:10:30	serhiy.storchaka	set	files: - codecs_writer_2.patch
2012-10-31 13:10:19	serhiy.storchaka	set	files: + codecs_writer_2.patch
2012-10-31 12:50:05	serhiy.storchaka	set	files: + codecs_writer_2.patch messages: + msg174273
2012-10-30 23:17:20	serhiy.storchaka	set	messages: + msg174238
2012-10-30 01:02:32	vstinner	set	messages: + msg174171
2012-10-24 19:44:32	serhiy.storchaka	set	messages: + msg173697
2012-10-24 18:38:53	vstinner	set	nosy: + loewis, serhiy.storchaka
2012-10-24 18:38:21	vstinner	create