classification
Title: Optimize PyUnicode_FromString for short ASCII strings
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.9
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: methane, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2019-06-20 11:40 by methane, last changed 2019-06-24 07:08 by methane. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 14264 closed methane, 2019-06-20 11:41
PR 14273 closed methane, 2019-06-20 16:00
PR 14283 merged methane, 2019-06-21 10:28
PR 14291 closed methane, 2019-06-21 16:53
Messages (15)
msg346115 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-06-20 11:40
_PyUnicode_FromASCII(s, len) is faster than PyUnicode_FromString(s) because PyUnicode_FromString() uses temporary _PyUnicodeWriter to support UTF-8.

But _PyUnicode_FromASCII("hello", strlen("hello"))` is not as easy as `PyUnicode_FromString("hello")`.

_PyUnicode_FROM_ASCII() is simple macro which wraps _PyUnicode_FromASCII which calls strlen() automatically:

```
/* Convenient wrapper for _PyUnicode_FromASCII */
#define _PyUnicode_FROM_ASCII(s) _PyUnicode_FromASCII((s), strlen(s))
```

I believe recent compilers optimize away calls of strlen().
msg346116 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-06-20 11:58
> #define _PyUnicode_FROM_ASCII(s) _PyUnicode_FromASCII((s), strlen(s))

LGTM.
msg346117 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-06-20 12:27
I confirmed at least one measurable speedup:

./python -m pyperf timeit -s 'd={}' -- 'repr(d)'

FromString: Mean +- std dev: 157 ns +- 3 ns
FROM_ASCII: Mean +- std dev: 132 ns +- 3 ns

>>> (157-132)/157
0.1592356687898089
msg346118 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-06-20 12:39
Should we make these APIs public?

* rename `_PyUnicode_FromASCII` to `PyUnicode_FromASCIIAndSize`.
* add `PyUnicode_FromASCII` as static inline function (without ABI).
* add `#define _PyUnicode_FromASCII PyUnicode_FromASCIIAndSize` for source code compatibility.
msg346121 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-06-20 14:17
Most of changes are in not performance sensitive code. I do not think there is a benefit of using new macro there.

If PyUnicode_FromString() is slow I prefer to optimize it instead of adding yet one esoteric private function for internal needs.

In case of dict.__repr__() we can get even more gain by using _Py_IDENTIFIER or more general API proposed by Victor.
msg346127 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-06-20 16:03
> Most of changes are in not performance sensitive code. I do not think there is a benefit of using new macro there.

Because I used just sed.

> If PyUnicode_FromString() is slow I prefer to optimize it instead of adding yet one esoteric private function for internal needs.

OK.  There are some code like `PyUnicode_FromString(name)`.  Optimizing PyUnicode_FromString will be more effective.  I created PR 14273.


> In case of dict.__repr__() we can get even more gain by using _Py_IDENTIFIER or more general API proposed by Victor.

Of course, I used it just for micro benchmarking.  Optimizing it is not a goal.  In case of PR 14273:

$ ./python -m pyperf timeit -s 'd={}' -- 'repr(d)'
.....................
Mean +- std dev: 138 ns +- 2 ns
msg346132 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-06-20 16:32
Oh, wait.  Why we used _PyUnicodeWriter here?
Decoding UTF-8 must not require it.  2-pass is enough.
msg346134 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-06-20 16:33
> _PyUnicode_FromASCII(s, len) is faster than PyUnicode_FromString(s) because PyUnicode_FromString() uses temporary _PyUnicodeWriter to support UTF-8.

I don't understand how _PyUnicodeWriter could be slow. It does not overallocate by default. It's just wrapper to implement efficient memory management.

> Oh, wait.  Why we used _PyUnicodeWriter here?

To optimize decoding errors: the error handler can use replacement string longer than 1 character. Overallocation is used in this case.
msg346202 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-06-21 10:30
> I don't understand how _PyUnicodeWriter could be slow. It does not overallocate by default. It's just wrapper to implement efficient memory management.

I misunderstood _PyUnicodeWriter.  I thought it caused one more allocation, but it doesn't.

But _PyUnicodeWriter is still slow, because gcc and clang are not smart enough to optimize _PyUnicodeWriter_Init() & _PyUnicodeWriter_Prepare().

See this example:

```
#define PY_SSIZE_T_CLEAN
#include <Python.h>

#define S(s) (s),strlen(s)

int
main(int argc, char *argv[])
{
    Py_Initialize();

    for (int i=0; i<100000000; i++) {
        //PyObject *s = PyUnicode_FromString("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");
        PyObject *s = _PyUnicode_FromASCII(S("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"));
        Py_DECREF(s);
    }
    return 0;
}
```

PyUnicode_FromString() takes about 4 sec on my machine.  _PyUnicode_FromASCII() is about 2 sec.
By skipping _PyUnicodeWriter for ASCII string (GH-14283), PyUnicode_FromString() takes about 3 sec.

```
$ time ./x  # PyUnicode_FromString

real    0m4.085s
user    0m4.081s
sys     0m0.004s

$ time ./y  # PyUnicode_FromString (skip _PyUnicode_Writer, GH-14283)

real    0m2.988s
user    0m2.988s
sys     0m0.000s

$ time ./z  # _PyUnicode_FromASCII
$ time ./z

real    0m1.975s
user    0m1.975s
sys     0m0.000s
```
msg346203 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-06-21 10:37
Another micro benchmark:

```
$ ./python-master -m pyperf timeit -o m1.json 'b=b"foobar"' -- 'b.decode()'
.....................
Mean +- std dev: 93.1 ns +- 2.4 ns

$ ./python -m pyperf timeit -o m2.json 'b=b"foobar"' -- 'b.decode()'
.....................
Mean +- std dev: 83.1 ns +- 2.6 ns

$ ./python -m pyperf compare_to m1.json m2.json
Mean +- std dev: [m1] 93.1 ns +- 2.4 ns -> [m2] 83.1 ns +- 2.6 ns: 1.12x faster (-11%)
```
msg346237 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-06-21 17:10
PR 14291 seems simpler than PR 14283.  But PR 14283 is faster because _PyUnicodeWriter is a learge struct.

master: 3.7sec
PR 14283: 2.9sec
PR 14291: 3.45sec

compiler: gcc (Ubuntu 8.3.0-6ubuntu1) 8.3.0
without LTO, without PGO
msg346247 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-06-21 21:03
I'm confused by the issue title: PyUnicode_GetString() doesn't exist, it's PyUnicode_FromString() :-) I changed the title.
msg346351 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-06-24 03:30
New changeset 770847a7db33b3d4c451b42372b6942687aa6121 by Inada Naoki in branch 'master':
bpo-37348: optimize decoding ASCII string (GH-14283)
https://github.com/python/cpython/commit/770847a7db33b3d4c451b42372b6942687aa6121
msg346354 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-06-24 06:13
Could you please measure the performance for long strings (1000, 10000 and 100000 characters): a long ASCII string and a long ASCII string ending with a non-ASCII character?
msg346356 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-06-24 07:08
This optimization is only for short strings.  There is no significant difference for long and non-ASCII strings.

```
# 1000 ASCII
$ ./python -m pyperf timeit --compare-to=./python-master -s 'b=b"f"*1000' -- 'b.decode()'
python-master: ..................... 196 ns +- 1 ns
python: ..................... 185 ns +- 1 ns

Mean +- std dev: [python-master] 196 ns +- 1 ns -> [python] 185 ns +- 1 ns: 1.06x faster (-6%)

# 10000 ASCII
$ ./python -m pyperf timeit --compare-to=./python-master -s 'b=b"f"*10000' -- 'b.decode()'
python-master: ..................... 671 ns +- 4 ns
python: ..................... 662 ns +- 1 ns

Mean +- std dev: [python-master] 671 ns +- 4 ns -> [python] 662 ns +- 1 ns: 1.01x faster (-1%)

# 100000 ASCII
$ ./python -m pyperf timeit --compare-to=./python-master -s 'b=b"f"*100000' -- 'b.decode()'
python-master: ..................... 5.86 us +- 0.03 us
python: ..................... 5.85 us +- 0.02 us

Mean +- std dev: [python-master] 5.86 us +- 0.03 us -> [python] 5.85 us +- 0.02 us: 1.00x faster (-0%)

# 1 non-ASCII
$ ./python -m pyperf timeit --compare-to=./python-master -s 'b="あ".encode()' -- 'b.decode()'
python-master: ..................... 138 ns +- 2 ns
python: ..................... 136 ns +- 2 ns

Mean +- std dev: [python-master] 138 ns +- 2 ns -> [python] 136 ns +- 2 ns: 1.02x faster (-2%)

# 1000 ASCII + 1 non-ASCII
$ ./python -m pyperf timeit --compare-to=./python-master -s 'b=b"x"*1000 + "あ".encode()' -- 'b.decode()'
python-master: ..................... 361 ns +- 9 ns
python: ..................... 360 ns +- 5 ns

Mean +- std dev: [python-master] 361 ns +- 9 ns -> [python] 360 ns +- 5 ns: 1.00x faster (-0%)
Not significant!

# 10000 ASCII + 1 non-ASCII
$ ./python -m pyperf timeit --compare-to=./python-master -s 'b=b"x"*10000 + "あ".encode()' -- 'b.decode()'
python-master: ..................... 2.83 us +- 0.02 us
python: ..................... 2.83 us +- 0.03 us

Mean +- std dev: [python-master] 2.83 us +- 0.02 us -> [python] 2.83 us +- 0.03 us: 1.00x slower (+0%)
Not significant!
```
History
Date User Action Args
2019-06-24 07:08:11methanesetmessages: + msg346356
2019-06-24 06:13:28serhiy.storchakasetmessages: + msg346354
2019-06-24 03:30:43methanesetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2019-06-24 03:30:30methanesetmessages: + msg346351
2019-06-21 21:03:51vstinnersetmessages: + msg346247
title: Optimize PyUnicode_GetString for short ASCII strings -> Optimize PyUnicode_FromString for short ASCII strings
2019-06-21 17:10:02methanesetmessages: + msg346237
2019-06-21 16:53:37methanesetpull_requests: + pull_request14114
2019-06-21 10:37:24methanesetmessages: + msg346203
2019-06-21 10:30:07methanesetmessages: + msg346202
2019-06-21 10:28:36methanesetpull_requests: + pull_request14105
2019-06-20 16:33:54vstinnersetmessages: + msg346134
2019-06-20 16:32:09methanesetmessages: + msg346132
2019-06-20 16:03:34methanesetmessages: + msg346127
title: add _PyUnicode_FROM_ASCII macro -> Optimize PyUnicode_GetString for short ASCII strings
2019-06-20 16:00:33methanesetpull_requests: + pull_request14097
2019-06-20 14:17:35serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg346121
2019-06-20 12:39:17methanesetmessages: + msg346118
2019-06-20 12:27:06methanesetmessages: + msg346117
2019-06-20 11:58:17vstinnersetnosy: + vstinner
messages: + msg346116
2019-06-20 11:41:20methanesetkeywords: + patch
stage: patch review
pull_requests: + pull_request14091
2019-06-20 11:40:57methanecreate