classification
Title: Use 8-byte step to detect ASCII sequence in 64bit Windows builds
Type: performance Stage: resolved
Components: Interpreter Core, Windows Versions: Python 3.10
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: malin, methane, paul.moore, serhiy.storchaka, sir-sigurd, steve.dower, tim.golden, zach.ware
Priority: normal Keywords: patch

Created on 2019-09-22 11:50 by malin, last changed 2020-10-18 16:52 by serhiy.storchaka. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 16334 merged malin, 2019-09-23 13:30
Messages (6)
msg352970 - (view) Author: Ma Lin (malin) * Date: 2019-09-22 11:50
C type `long` is 4-byte integer in 64-bit Windows build. [1]

But `ucs1lib_find_max_char()` function [2] uses SIZEOF_LONG, so it loses a little performance in 64-bit Windows build.

Below is the benchmark of using SIZEOF_SIZE_T and this change:

    -   unsigned long value = *(unsigned long *) _p;
    +   sizt_t value = *(sizt_t *) _p;

D:\dev\cpython\PCbuild\amd64\python.exe -m pyperf timeit -s "b=b'a'*10_000_000; f=b.decode;" "f('latin1')"

    before: 5.83 ms +- 0.05 ms
    after : 5.58 ms +- 0.06 ms

[1] https://stackoverflow.com/questions/384502

[2] https://github.com/python/cpython/blob/v3.8.0b4/Objects/stringlib/find_max_char.h#L9

Maybe there can be more optimizations, so I didn't prepare a PR for this.
msg352998 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-09-23 09:24
This looks like a good idea. Do you mind to create a PR?
msg353007 - (view) Author: Ma Lin (malin) * Date: 2019-09-23 10:50
Maybe @sir-sigurd can find more optimizations.

FYI, `_Py_bytes_isascii()` function [1] also has similar code.
[1] https://github.com/python/cpython/blob/v3.8.0b4/Objects/bytes_methods.c#L104
msg353024 - (view) Author: Ma Lin (malin) * Date: 2019-09-23 14:52
There are 4 functions have the similar code, see PR 16334.
Just replaced the `unsigned long` type with `size_t` type, got these benchmarks.
Can this be backported to 3.8 branch?

1.  bytes.isascii()

D:\dev\cpython\PCbuild\amd64\python.exe -m pyperf timeit -s "b = b'x' * 100_000_000; f = b.isascii;" "f()"

+-----------+-----------+------------------------------+
| Benchmark | isascii_a | isascii_b                    |
+===========+===========+==============================+
| timeit    | 11.7 ms   | 7.84 ms: 1.50x faster (-33%) |
+-----------+-----------+------------------------------+

2.  bytes.decode('latin1')

D:\dev\cpython\PCbuild\amd64\python.exe -m pyperf timeit -s "b = b'x' * 100_000_000; f = b.decode;" "f('latin1')"

+-----------+----------+-----------------------------+
| Benchmark | latin1_a | latin1_b                    |
+===========+==========+=============================+
| timeit    | 60.3 ms  | 57.4 ms: 1.05x faster (-5%) |
+-----------+----------+-----------------------------+

3.  bytes.decode('ascii')

D:\dev\cpython\PCbuild\amd64\python.exe -m pyperf timeit -s "b = b'x' * 100_000_000; f = b.decode;" "f('ascii')"

+-----------+---------+-----------------------------+
| Benchmark | ascii_a | ascii_b                     |
+===========+=========+=============================+
| timeit    | 48.5 ms | 47.1 ms: 1.03x faster (-3%) |
+-----------+---------+-----------------------------+

4.  bytes.decode('utf8')

D:\dev\cpython\PCbuild\amd64\python.exe -m pyperf timeit -s "b = b'x' * 100_000_000; f = b.decode;" "f('utf8')"

+-----------+---------+-----------------------------+
| Benchmark | utf8_a  | utf8_b                      |
+===========+=========+=============================+
| timeit    | 48.3 ms | 47.1 ms: 1.03x faster (-3%) |
+-----------+---------+-----------------------------+
msg378801 - (view) Author: Ma Lin (malin) * Date: 2020-10-17 04:48
Although the improvement is not great, it's a very hot code path.

Could you review the PR?
msg378869 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-10-18 14:48
New changeset a0c603cb9d4dbb9909979313a88bcd1f5fde4f62 by Ma Lin in branch 'master':
bpo-38252: Use 8-byte step to detect ASCII sequence in 64bit Windows build (GH-16334)
https://github.com/python/cpython/commit/a0c603cb9d4dbb9909979313a88bcd1f5fde4f62
History
Date User Action Args
2020-10-18 16:52:39serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2020-10-18 14:48:46serhiy.storchakasetmessages: + msg378869
2020-10-17 04:48:27malinsetnosy: + paul.moore, tim.golden
messages: + msg378801
components: + Windows
2020-06-14 11:23:50cheryl.sabellasetnosy: + zach.ware, steve.dower

versions: + Python 3.10, - Python 3.9
2019-09-23 14:52:22malinsetmessages: + msg353024
2019-09-23 14:00:13malinsettitle: micro-optimize ucs1lib_find_max_char in Windows 64-bit build -> Use 8-byte step to detect ASCII sequence in 64bit Windows builds
2019-09-23 13:30:30malinsetkeywords: + patch
stage: patch review
pull_requests: + pull_request15911
2019-09-23 10:50:05malinsetmessages: + msg353007
2019-09-23 09:24:12serhiy.storchakasetmessages: + msg352998
2019-09-22 11:50:02malincreate