Issue 22649: Use _PyUnicodeWriter in case_operation()

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/66839

classification

Title:	Use _PyUnicodeWriter in case_operation()
Type:	performance	Stage:
Components:		Versions:	Python 3.5

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:		Nosy List:	Arfrever, pitrou, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	patch

Created on 2014-10-15 20:30 by vstinner, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
case_writer.patch	vstinner, 2014-10-15 20:30		review
bench_case.py	vstinner, 2014-10-15 20:49
bench.txt	vstinner, 2014-10-15 20:50

Messages (5)
msg229497 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-10-15 20:30
The case_operation() in Objects/unicodeobject.c is used for case operations: lower, upper, casefold, etc. Currently, the function uses a buffer of Py_UCS4 and overallocate the buffer by 300%. The function uses the worst case: one character replaced with 3 characters. I propose the use the _PyUnicodeWriter API to be able to optimize the most common case: each character is replaced by only one another character, and the output string uses the same unicode kind (UCS1, UCS2 or UCS4). The patch preallocates the writer using the kind of the input string, but in some cases, the result uses a lower kind (ex: latin1 => ASCII). "Special" characters taking the slow path from unit tests: - test_capitalize: 'ﬁnnish' => 'FInnish' (ascii) - test_casefold: 'ß' => 'ss', 'ﬁ' => 'fi' - test_swapcase: 'ﬁ' => 'FI', 'ß' => 'SS' - test_title: 'ﬁNNISH' => 'Finnish' - test_upper: 'ﬁ' => 'FI', 'ß' => 'SS' The writer only uses overallocation if a replaced character uses more than one character. Bad cases where the length changes: - test_capitalize: 'ῳῳῼῼ' => 'ΩΙῳῳῳ', 'hİ' => 'Hi̇', 'ῒİ' => 'Ϊ̀i̇', 'ﬁnnish' => 'FInnish' - test_casefold: 'ß' => 'ss', 'ﬁ' => 'fi' - test_lower: 'İ' => 'i̇' - test_swapcase: 'ﬁ' => 'FI', 'İ' => 'i̇', 'ß' => 'SS', 'ῒ' => 'Ϊ̀' - test_title: 'ﬁNNISH' => 'Finnish' - test_upper: 'ﬁ' => 'FI', 'ß' => 'SS', 'ῒ', 'Ϊ̀'
msg229499 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-10-15 20:48
Add tests for 'µ' or 'ÿ' (upper maps UCS1 to UCS2), 'ΐ' or like (upper maps UCS2 to 3 UCS2), 'ﬃ' or 'ﬄ' (upper maps UCS2 to 3 ASCII), 'İ' (only one character for which lower doesn't map to 1 character), 'Å' (lower maps UCS2 to UCS1), any of Deseret or Warang Citi characters (UCS4).
msg229501 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-10-15 20:49
Benchmark: bench_case.py. Hum, case_writer.patch looks to be always slower: --------------------+--------------+---------------- Summary \| orig \| writer --------------------+--------------+---------------- lower with 'a' \| 5.76 ms () \| 5.76 ms lower with 'é' \| 62.9 ms () \| 76.8 ms (+22%) lower with '€' \| 75.2 ms () \| 83.6 ms (+11%) lower with 'ﬁ' \| 75.3 ms () \| 83.7 ms (+11%) lower with 'ß' \| 66.4 ms () \| 76 ms (+15%) upper with 'a' \| 5.66 ms () \| 5.66 ms upper with 'é' \| 48.3 ms () \| 75.9 ms (+57%) upper with '€' \| 50.1 ms () \| 77.9 ms (+55%) upper with 'ﬁ' \| 93.7 ms () \| 137 ms (+46%) upper with 'ß' \| 91.9 ms () \| 119 ms (+29%) casefold with 'a' \| 5.66 ms () \| 5.67 ms casefold with 'é' \| 64.5 ms () \| 95.8 ms (+48%) casefold with '€' \| 67 ms () \| 96.1 ms (+43%) casefold with 'ﬁ' \| 97.1 ms () \| 132 ms (+35%) casefold with 'ß' \| 93.7 ms () \| 122 ms (+30%) swapcase with 'a' \| 99.7 ms () \| 107 ms (+7%) swapcase with 'é' \| 99.7 ms () \| 107 ms (+7%) swapcase with '€' \| 78 ms () \| 87.4 ms (+12%) swapcase with 'ﬁ' \| 143 ms () \| 152 ms (+7%) swapcase with 'ß' \| 140 ms () \| 138 ms title with 'a' \| 82 ms () \| 98.2 ms (+20%) title with 'é' \| 81.9 ms () \| 98.2 ms (+20%) title with '€' \| 90.2 ms () \| 115 ms (+28%) title with 'ﬁ' \| 93.9 ms () \| 112 ms (+20%) title with 'ß' \| 91.3 ms () \| 103 ms (+13%) capitalize with 'a' \| 62.3 ms () \| 79.2 ms (+27%) capitalize with 'é' \| 62.1 ms () \| 79.1 ms (+27%) capitalize with '€' \| 72.9 ms () \| 76.5 ms capitalize with 'ﬁ' \| 72.6 ms () \| 90.3 ms (+24%) capitalize with 'ß' \| 69.5 ms () \| 80 ms (+15%) --------------------+--------------+---------------- Total \| 2.24 sec (*) \| 2.71 sec (+21%) --------------------+--------------+---------------- See bench.txt for the full output.
msg229534 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-10-16 15:55
Looks like it's cheaper to overallocate than add checks for overflow at each loop iteration.
msg229536 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-10-16 16:20
> Looks like it's cheaper to overallocate than add checks for overflow at each loop iteration. I expected that the temporary Py_UCS4 buffer and the conversion to a Unicode object (Py_UCS1, Py_UCS2 or Py_UCS4) would be more expensive than _PyUnicodeWriter. It looks like it's slower. I tried to optimize the code but I didn't see how to make it really faster than the current code. -- Currently, the code uses: for (j = 0; j < n_res; j++) { maxchar = Py_MAX(maxchar, mapped[j]); res[k++] = mapped[j]; } where res is a Py_UCS4* string, and mapped an array of 3 Py_UCS4. I replaced it with a call to case_operation_write() which calls _PyUnicodeWriter_WriteCharInline(). _PyUnicodeWriter_WriteCharInline() is maybe more expensive than "res[k++] = mapped[j];".

History
Date	User	Action	Args
2022-04-11 14:58:09	admin	set	github: 66839
2015-04-06 21:43:02	vstinner	set	status: open -> closed resolution: rejected
2014-10-16 16:20:07	vstinner	set	messages: + msg229536
2014-10-16 15:55:24	pitrou	set	nosy: + pitrou messages: + msg229534
2014-10-16 12:21:35	Arfrever	set	nosy: + Arfrever
2014-10-15 20:50:15	vstinner	set	files: + bench.txt
2014-10-15 20:49:58	vstinner	set	files: + bench_case.py messages: + msg229501
2014-10-15 20:48:15	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg229499
2014-10-15 20:30:45	vstinner	create