classification
Title: Consistency in unicode string multiplication with an integer
Type: performance Stage: resolved
Components: Unicode Versions: Python 3.10
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, serhiy.storchaka, syl-nktaylor, vstinner
Priority: normal Keywords:

Created on 2020-12-07 20:18 by syl-nktaylor, last changed 2020-12-07 22:06 by syl-nktaylor. This issue is now closed.

Messages (4)
msg382681 - (view) Author: syl-nktaylor (syl-nktaylor) Date: 2020-12-07 20:18
In https://github.com/python/cpython/blob/master/Objects/unicodeobject.c#L12930, unicode_repeat does string multiplication with an integer in 3 different ways:
1) one memset call, for utf-8 when string size is 1
2) linear 'for' loops, for utf-16 and utf-32 when string size is 1
3) logarithmic 'while' loop with memcpy calls, for utf-8/utf-16/utf-32 when string size > 1 

Is there a performance or correctness reason for which we can't also use the 3rd way for the second case? I realise depending on architecture, the second case could benefit from vectorization, but the memcpy calls will also be hardware optimised.

An example of using just the 1st and 3rd methods:

--- a/Objects/unicodeobject.c
+++ b/Objects/unicodeobject.c
@@ -12954,31 +12954,16 @@ unicode_repeat(PyObject *str, Py_ssize_t len)
     assert(PyUnicode_KIND(u) == PyUnicode_KIND(str));

-    if (PyUnicode_GET_LENGTH(str) == 1) {
-        int kind = PyUnicode_KIND(str);
-        Py_UCS4 fill_char = PyUnicode_READ(kind, PyUnicode_DATA(str), 0);
-        if (kind == PyUnicode_1BYTE_KIND) {
-            void *to = PyUnicode_DATA(u);
-            memset(to, (unsigned char)fill_char, len);
-        }
-        else if (kind == PyUnicode_2BYTE_KIND) {
-            Py_UCS2 *ucs2 = PyUnicode_2BYTE_DATA(u);
-            for (n = 0; n < len; ++n)
-                ucs2[n] = fill_char;
-        } else {
-            Py_UCS4 *ucs4 = PyUnicode_4BYTE_DATA(u);
-            assert(kind == PyUnicode_4BYTE_KIND);
-            for (n = 0; n < len; ++n)
-                ucs4[n] = fill_char;
-        }
-    }
-    else {
+    Py_ssize_t char_size = PyUnicode_KIND(str);
+    char *to = (char *) PyUnicode_DATA(u);
+    if (PyUnicode_GET_LENGTH(str) == 1 && char_size == PyUnicode_1BYTE_KIND) {
+        Py_UCS4 fill_char = PyUnicode_READ(char_size, PyUnicode_DATA(str), 0);
+        memset(to, fill_char, len);
+    } else {
         /* number of characters copied this far */
         Py_ssize_t done = PyUnicode_GET_LENGTH(str);
-        Py_ssize_t char_size = PyUnicode_KIND(str);
-        char *to = (char *) PyUnicode_DATA(u);
         memcpy(to, PyUnicode_DATA(str),
                   PyUnicode_GET_LENGTH(str) * char_size);
         while (done < nchars) {...
msg382682 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-12-07 20:37
BTW, CPython does not use UTF-8 and UTF-16 encoding in internal representation of strings. It uses Latin1, UCS2 and UCS4 (UTF-32).

What benchmarks show? Is your code always faster and how much? If it is slower for some data, for what data and how much?
msg382687 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-12-07 21:53
+        Py_UCS4 fill_char = PyUnicode_READ(char_size, PyUnicode_DATA(str), 0);
+        memset(to, fill_char, len);

The second parameter of memset() is a byte (8-bit "octet"). You cannot pass Py_UCS4 to memset(), it doesn't work.
msg382689 - (view) Author: syl-nktaylor (syl-nktaylor) Date: 2020-12-07 22:06
The build did seem to run, despite memset using fillchar without the explicit casting, so I assumed it did an implicit casting, but the original casting can be kept of course. With this build, my sample tests for 1-byte, 2-byte and 4-byte chars also ran ok, so again, I assumed memset did the casting.

As for the benchmarks, I haven't run them, but yes, it's likely the best way to find out the performance considerations I was asking about.
History
Date User Action Args
2020-12-07 22:06:03syl-nktaylorsetstatus: open -> closed
resolution: not a bug
messages: + msg382689

stage: resolved
2020-12-07 21:53:35vstinnersetmessages: + msg382687
2020-12-07 20:37:13serhiy.storchakasettype: behavior -> performance

messages: + msg382682
nosy: + serhiy.storchaka
2020-12-07 20:18:22syl-nktaylorcreate