This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Use _PyUnicodeWriter in case_operation()
Type: performance Stage:
Components: Versions: Python 3.5
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, pitrou, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2014-10-15 20:30 by vstinner, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
case_writer.patch vstinner, 2014-10-15 20:30 review
bench_case.py vstinner, 2014-10-15 20:49
bench.txt vstinner, 2014-10-15 20:50
Messages (5)
msg229497 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-10-15 20:30
The case_operation() in Objects/unicodeobject.c is used for case operations: lower, upper, casefold, etc.

Currently, the function uses a buffer of Py_UCS4 and overallocate the buffer by 300%. The function uses the worst case: one character replaced with 3 characters.

I propose the use the _PyUnicodeWriter API to be able to optimize the most common case: each character is replaced by only one another character, and the output string uses the same unicode kind (UCS1, UCS2 or UCS4).

The patch preallocates the writer using the kind of the input string, but in some cases, the result uses a lower kind (ex: latin1 => ASCII). "Special" characters taking the slow path from unit tests:

- test_capitalize: 'finnish' => 'FInnish' (ascii)
- test_casefold: 'ß' => 'ss', 'fi' => 'fi'
- test_swapcase: 'fi' => 'FI', 'ß' => 'SS'
- test_title: 'fiNNISH' => 'Finnish'
- test_upper: 'fi' => 'FI', 'ß' => 'SS'

The writer only uses overallocation if a replaced character uses more than one character. Bad cases where the length changes:

- test_capitalize: 'ῳῳῼῼ' => 'ΩΙῳῳῳ', 'hİ' => 'Hi̇', 'ῒİ' => 'Ϊ̀i̇', 'finnish' => 'FInnish'
- test_casefold: 'ß' => 'ss', 'fi' => 'fi'
- test_lower: 'İ' => 'i̇'
- test_swapcase: 'fi' => 'FI', 'İ' => 'i̇', 'ß' => 'SS', 'ῒ' => 'Ϊ̀'
- test_title: 'fiNNISH' => 'Finnish'
- test_upper: 'fi' => 'FI', 'ß' => 'SS', 'ῒ', 'Ϊ̀'
msg229499 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-10-15 20:48
Add tests for 'µ' or 'ÿ' (upper maps UCS1 to UCS2), 'ΐ' or like (upper maps UCS2 to 3 UCS2), 'ffi' or 'ffl' (upper maps UCS2 to 3 ASCII), 'İ' (only one character for which lower doesn't map to 1 character), 'Å' (lower maps UCS2 to UCS1), any of Deseret or Warang Citi characters (UCS4).
msg229501 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-10-15 20:49
Benchmark: bench_case.py. Hum, case_writer.patch looks to be always slower:

--------------------+--------------+----------------
Summary             |         orig |          writer
--------------------+--------------+----------------
lower with 'a'      |  5.76 ms (*) |         5.76 ms
lower with 'é'      |  62.9 ms (*) |  76.8 ms (+22%)
lower with '€'      |  75.2 ms (*) |  83.6 ms (+11%)
lower with 'fi'      |  75.3 ms (*) |  83.7 ms (+11%)
lower with 'ß'      |  66.4 ms (*) |    76 ms (+15%)
upper with 'a'      |  5.66 ms (*) |         5.66 ms
upper with 'é'      |  48.3 ms (*) |  75.9 ms (+57%)
upper with '€'      |  50.1 ms (*) |  77.9 ms (+55%)
upper with 'fi'      |  93.7 ms (*) |   137 ms (+46%)
upper with 'ß'      |  91.9 ms (*) |   119 ms (+29%)
casefold with 'a'   |  5.66 ms (*) |         5.67 ms
casefold with 'é'   |  64.5 ms (*) |  95.8 ms (+48%)
casefold with '€'   |    67 ms (*) |  96.1 ms (+43%)
casefold with 'fi'   |  97.1 ms (*) |   132 ms (+35%)
casefold with 'ß'   |  93.7 ms (*) |   122 ms (+30%)
swapcase with 'a'   |  99.7 ms (*) |    107 ms (+7%)
swapcase with 'é'   |  99.7 ms (*) |    107 ms (+7%)
swapcase with '€'   |    78 ms (*) |  87.4 ms (+12%)
swapcase with 'fi'   |   143 ms (*) |    152 ms (+7%)
swapcase with 'ß'   |   140 ms (*) |          138 ms
title with 'a'      |    82 ms (*) |  98.2 ms (+20%)
title with 'é'      |  81.9 ms (*) |  98.2 ms (+20%)
title with '€'      |  90.2 ms (*) |   115 ms (+28%)
title with 'fi'      |  93.9 ms (*) |   112 ms (+20%)
title with 'ß'      |  91.3 ms (*) |   103 ms (+13%)
capitalize with 'a' |  62.3 ms (*) |  79.2 ms (+27%)
capitalize with 'é' |  62.1 ms (*) |  79.1 ms (+27%)
capitalize with '€' |  72.9 ms (*) |         76.5 ms
capitalize with 'fi' |  72.6 ms (*) |  90.3 ms (+24%)
capitalize with 'ß' |  69.5 ms (*) |    80 ms (+15%)
--------------------+--------------+----------------
Total               | 2.24 sec (*) | 2.71 sec (+21%)
--------------------+--------------+----------------

See bench.txt for the full output.
msg229534 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-10-16 15:55
Looks like it's cheaper to overallocate than add checks for overflow at each loop iteration.
msg229536 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-10-16 16:20
> Looks like it's cheaper to overallocate than add checks for overflow at each loop iteration.

I expected that the temporary Py_UCS4 buffer and the conversion to a Unicode object (Py_UCS1, Py_UCS2 or Py_UCS4) would be more expensive than _PyUnicodeWriter. It looks like it's slower.

I tried to optimize the code but I didn't see how to make it really faster than the current code.

--

Currently, the code uses:

for (j = 0; j < n_res; j++) {
   *maxchar = Py_MAX(*maxchar, mapped[j]);
   res[k++] = mapped[j];
}

where res is a Py_UCS4* string, and mapped an array of 3 Py_UCS4.

I replaced it with a call to case_operation_write() which calls _PyUnicodeWriter_WriteCharInline().

_PyUnicodeWriter_WriteCharInline() is maybe more expensive than "res[k++] = mapped[j];".
History
Date User Action Args
2022-04-11 14:58:09adminsetgithub: 66839
2015-04-06 21:43:02vstinnersetstatus: open -> closed
resolution: rejected
2014-10-16 16:20:07vstinnersetmessages: + msg229536
2014-10-16 15:55:24pitrousetnosy: + pitrou
messages: + msg229534
2014-10-16 12:21:35Arfreversetnosy: + Arfrever
2014-10-15 20:50:15vstinnersetfiles: + bench.txt
2014-10-15 20:49:58vstinnersetfiles: + bench_case.py

messages: + msg229501
2014-10-15 20:48:15serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg229499
2014-10-15 20:30:45vstinnercreate