Issue 42593: Consistency in unicode string multiplication with an integer

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/86759

classification

Title:	Consistency in unicode string multiplication with an integer
Type:	performance	Stage:	resolved
Components:	Unicode	Versions:	Python 3.10

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	ezio.melotti, serhiy.storchaka, syl-nktaylor, vstinner
Priority:	normal	Keywords:

Created on 2020-12-07 20:18 by syl-nktaylor, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (4)
msg382681 - (view)	Author: syl-nktaylor (syl-nktaylor)	Date: 2020-12-07 20:18
In https://github.com/python/cpython/blob/master/Objects/unicodeobject.c#L12930, unicode_repeat does string multiplication with an integer in 3 different ways: 1) one memset call, for utf-8 when string size is 1 2) linear 'for' loops, for utf-16 and utf-32 when string size is 1 3) logarithmic 'while' loop with memcpy calls, for utf-8/utf-16/utf-32 when string size > 1 Is there a performance or correctness reason for which we can't also use the 3rd way for the second case? I realise depending on architecture, the second case could benefit from vectorization, but the memcpy calls will also be hardware optimised. An example of using just the 1st and 3rd methods: --- a/Objects/unicodeobject.c +++ b/Objects/unicodeobject.c @@ -12954,31 +12954,16 @@ unicode_repeat(PyObject str, Py_ssize_t len) assert(PyUnicode_KIND(u) == PyUnicode_KIND(str)); - if (PyUnicode_GET_LENGTH(str) == 1) { - int kind = PyUnicode_KIND(str); - Py_UCS4 fill_char = PyUnicode_READ(kind, PyUnicode_DATA(str), 0); - if (kind == PyUnicode_1BYTE_KIND) { - void to = PyUnicode_DATA(u); - memset(to, (unsigned char)fill_char, len); - } - else if (kind == PyUnicode_2BYTE_KIND) { - Py_UCS2 ucs2 = PyUnicode_2BYTE_DATA(u); - for (n = 0; n < len; ++n) - ucs2[n] = fill_char; - } else { - Py_UCS4 ucs4 = PyUnicode_4BYTE_DATA(u); - assert(kind == PyUnicode_4BYTE_KIND); - for (n = 0; n < len; ++n) - ucs4[n] = fill_char; - } - } - else { + Py_ssize_t char_size = PyUnicode_KIND(str); + char to = (char ) PyUnicode_DATA(u); + if (PyUnicode_GET_LENGTH(str) == 1 && char_size == PyUnicode_1BYTE_KIND) { + Py_UCS4 fill_char = PyUnicode_READ(char_size, PyUnicode_DATA(str), 0); + memset(to, fill_char, len); + } else { /* number of characters copied this far / Py_ssize_t done = PyUnicode_GET_LENGTH(str); - Py_ssize_t char_size = PyUnicode_KIND(str); - char to = (char ) PyUnicode_DATA(u); memcpy(to, PyUnicode_DATA(str), PyUnicode_GET_LENGTH(str) char_size); while (done < nchars) {...
msg382682 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2020-12-07 20:37
BTW, CPython does not use UTF-8 and UTF-16 encoding in internal representation of strings. It uses Latin1, UCS2 and UCS4 (UTF-32). What benchmarks show? Is your code always faster and how much? If it is slower for some data, for what data and how much?
msg382687 - (view)	Author: STINNER Victor (vstinner) *	Date: 2020-12-07 21:53
+ Py_UCS4 fill_char = PyUnicode_READ(char_size, PyUnicode_DATA(str), 0); + memset(to, fill_char, len); The second parameter of memset() is a byte (8-bit "octet"). You cannot pass Py_UCS4 to memset(), it doesn't work.
msg382689 - (view)	Author: syl-nktaylor (syl-nktaylor)	Date: 2020-12-07 22:06
The build did seem to run, despite memset using fillchar without the explicit casting, so I assumed it did an implicit casting, but the original casting can be kept of course. With this build, my sample tests for 1-byte, 2-byte and 4-byte chars also ran ok, so again, I assumed memset did the casting. As for the benchmarks, I haven't run them, but yes, it's likely the best way to find out the performance considerations I was asking about.

History
Date	User	Action	Args
2022-04-11 14:59:39	admin	set	github: 86759
2020-12-07 22:06:03	syl-nktaylor	set	status: open -> closed resolution: not a bug messages: + msg382689 stage: resolved
2020-12-07 21:53:35	vstinner	set	messages: + msg382687
2020-12-07 20:37:13	serhiy.storchaka	set	type: behavior -> performance messages: + msg382682 nosy: + serhiy.storchaka
2020-12-07 20:18:22	syl-nktaylor	create