classification
Title: Cache single-character strings outside of the Latin1 range
Type: performance Stage: resolved
Components: Interpreter Core, Unicode Versions: Python 3.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, ezio.melotti, lemburg, methane, pitrou, serhiy.storchaka, terry.reedy, vstinner, xiang.zhang
Priority: normal Keywords: patch

Created on 2017-09-15 13:29 by serhiy.storchaka, last changed 2020-10-24 18:12 by serhiy.storchaka. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 3600 closed serhiy.storchaka, 2017-09-15 13:32
Messages (13)
msg302253 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-09-15 13:29
Single-character strings in the Latin1 range (U+0000 - U+00FF) are shared in CPython. This saves memory and CPU time of per-character processing of strings containing ASCII characters and characters from Latin based alphabets. But the users of languages that use non-Latin based alphabets are not so lucky. Proposed PR adds a cache for characters in BMP (U+0100 - U+FFFF) which covers most alphabetic scripts.

Most alphabets contain characters from a single 128- or 256-character block, therefore only lowest bits are used for addressing in the cache. But together with the characters from a particular alphabet it is common to use ASCII characters (spaces, newline, punctuations, digits, etc) and few Unicode punctuation (long dash, Unicode quotes, etc). Their low bits  can match low bits of letters. Therefore every index addresses not a single character, but a mini-LRU-cache of size 2. This keeps letters in a cache even if non-letters with conflicting low bits are occurred.

Microbenchmarking results.

Iterating sample non-Latin-based alphabetic text (Iliad by Homer [1]) is over 70% faster:

$ ./python -m timeit -s 's = open("36248-0.txt").read()' -- 'for c in s: pass'
Unpatched:  20 loops, best of 5: 14.5 msec per loop
Patched:    50 loops, best of 5: 8.32 msec per loop

Iterating sample hieroglyphic text (Shui Hu Zhuan by Nai an Shi [2]) is about 4% slower:

$ ./python -m timeit -s 's = open("23863-0.txt").read()' -- 'for c in s: pass'
Unpatched:  20 loops, best of 5: 11.7 msec per loop
Patched:    20 loops, best of 5: 12.1 msec per loop

Iterating a string containing non-repeated characters from the all BMP range is 20% slower:

$ ./python -m timeit -s 's = "".join(map(chr, range(0x10000)))' -- 'for c in s: pass'
Unpatched:  200 loops, best of 5: 1.39 msec per loop
Patched:    200 loops, best of 5: 1.7 msec per loop


[1] https://www.gutenberg.org/files/36248/36248-0.txt
[2] https://www.gutenberg.org/files/23863/23863-0.txt
msg302295 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2017-09-15 19:55
Are the timings for normal builds, with, as I understand things, asserts turned off?
msg302298 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-09-15 20:31
Yes, of course. There is not much sense to benchmark a debug build. Tested code is many times slower in debug builds due to complex consistency checking.
msg302300 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2017-09-15 20:53
I looked at the Gutenburg samples.  The first has a short intro with some English, then is pure Greek.  The patch is clearly good for anyone using mostly a single block alphabetic language.

The second is Chinese, not hieroglyphs (ancient Egyptian).  A slowdown for ancient Egyptian is irrelevant; a slowdown for Chinese is undesirable.  Japanese mostly uses about 2000 Chinese chars, the Chinses more.  Even if the common chars are grouped together (I don't know), there are at least 10 possible chars for each 2-char slot.  So I am not surprised at a net slowdown.  I would also not be surprised if Japanese fared worse, as it uses at least 2 blocks for its kana and uses many latin chars.

Unless we go beyond 2 x 256 slots, caching CJK is hopeless.  Have you considered limiting the caching to the blocks before the CJK blocks, up to, say, U+31BF?  https://en.wikipedia.org/wiki/Unicode_block.  Both Japanese and Korean might then see an actual speedup.
msg302302 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-09-15 22:32
Sorry for using the word hieroglyphs for Chinese characters. I didn't know how they are named in English.

Limiting the caching to U+31BF doesn't have effect. But increasing the cache size to 2 x 512 slots makes the iteration of Chinese text faster by around 20%. Following size duplications increase the speed by additional 15-25%. The limit is 75-80%. This doesn't affect Greek.
msg302304 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2017-09-15 23:05
The Greek sample includes 155 unique characters (including whitespace, punctuation, and the english characters at the beginning), so they can all fit in the cache.
The Chinese sample however includes 3695 unique characters (all within the BMP), probably causing a lot more misses in the cache and a slowdown caused by the overhead.
The Chinese text you used for the test is also from some 700 years ago, and uses traditional and vernacular Chinese, so the number of unique character is higher than what you would normally encounter in modern Chinese.
msg302329 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-09-16 08:14
Initially I used 2 x 128 slots. It is enough for single block alphabetic languages. But it was caused significant slow down for Chinese. Increasing the size to 2 x 256 compensates the overhead for Chinese and restores the performance. If it is appropriate that the optimization affects only languages with small alphabets and keeps the performance for Chinese, Japan and Korean roughly unchanged (plus-minus few percents), this size is enough. I we want to optimize also processing texts with Chinese characters, it can be increased to 2 x 512 or 2 x 1024. Further increasing have smaller effect.

The cache of size 2 x 256 slots can increase memory consumption by 50 KiB in worst case, 2 x 1024 -- by 200 KiB.
msg302330 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2017-09-16 09:42
Interesting optimization.  But I don't know how common `for c in s`
or `s[i]` is used for Japanese text.
msg302343 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2017-09-16 14:46
If I understand correctly, anyone could change the cache size for their personal or corporate binary by changing
#define BMP_CACHE_SIZE 256

There should be a comment that it must not be 0 and should be a power of 2 at least, say, 256.
msg302358 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2017-09-17 04:26
> The cache of size 2 x 256 slots can increase memory consumption by 50 KiB in worst case, 2 x 1024 -- by 200 KiB.

How much is this compared to the total usage?

> But I don't know how common `for c in s` or `s[i]` is used for Japanese text.

I think the same applies to other languages/scripts too, so this optimization might be moot unless the cache also improves performances of other more common operations (e.g. encoding/decoding).

It would be interesting to see how this affects real-world application: if there are no regressions and the memory overhead is not too much I think we can accept the patch.
msg302362 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-09-17 10:22
> > The cache of size 2 x 256 slots can increase memory consumption by 50 KiB
> > in worst case, 2 x 1024 -- by 200 KiB.
> How much is this compared to the total usage?

For Python interpreter VmSize: 31784 kB, VmRSS: 7900 kB.

The cache doesn't affect performance of encode/decode since codecs are written in C. But it affects text processing written in Python. The more complex text processing code the less relative difference.

$ ./python -m perf timeit --compare-to ./python0 -s 'import html; s = open("36248-0.txt").read()' -- 'html.escape(s)'
Mean +- std dev: [python0] 2.59 ms +- 0.06 ms -> [python] 1.58 ms +- 0.04 ms: 1.64x faster (-39%)

$ ./python -m perf timeit --compare-to ./python0 -s 'import re; s = open("36248-0.txt").read()' -- 're.escape(s)'
Mean +- std dev: [python0] 45.9 ms +- 1.3 ms -> [python] 44.0 ms +- 1.4 ms: 1.04x faster (-4%)

$ ./python -m perf timeit --compare-to ./python0 -s 'import fnmatch; s = open("36248-0.txt").read()' -- 'fnmatch.translate(s)'
Mean +- std dev: [python0] 448 ms +- 9 ms -> [python] 435 ms +- 9 ms: 1.03x faster (-3%)

$ ./python -m perf timeit --compare-to ./python0 -s 'import re, sre_compile; s = re.escape(open("36248-0.txt").read())' -- 'sre_compile.compile(s)'
Mean +- std dev: [python0] 1.44 sec +- 0.02 sec -> [python] 1.41 sec +- 0.02 sec: 1.01x faster (-1%)

$ ./python -m perf timeit --compare-to ./python0 -s 'import textwrap; s = open("36248-0.txt").read()' -- 'textwrap.wrap(s)'
Mean +- std dev: [python0] 208 ms +- 2 ms -> [python] 207 ms +- 2 ms: 1.01x faster (-1%)

In real-word applications I expect hundredths and thousandths fractions of percent. This effect can't be measured in any sane way.
msg302371 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2017-09-17 16:10
I run the patch against a toy NLP application, cutting words from Shui Hu Zhuan provided by Serhiy. The result is not bad, 6% faster. And I also count the hit rate, 90% hit cell 0, 4.5 hit cell 1, 5.5% miss. I also increase the cache size to 1024 * 2. Although the hit rate increases to 95.4%, 2.1%, 2.4%, it's still 6% difference.

So IMHO this patch could hardly affect that *much* real-world applications, better or worse. I couldn't recall clearly the implementation of unicode but why can't we reuse the latin1 cache when we use this bmp cache? And then to avoid the chars' low bits conflicting with ASCII chars' low bits we have to introduce the mini-LRU-cache, which is not that easily understandable.
msg302481 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-09-18 19:11
Judging by the numbers, this optimization does not sound worth the hassle.  It is quite rare to iterate over all characters in a long text while doing so little work with them that the cost of iteration is significant.

By the way:

> Sorry for using the word hieroglyphs for Chinese characters. I didn't know how they are named in English.

I think the word you were looking for is "ideogram".
History
Date User Action Args
2020-10-24 18:12:13serhiy.storchakasetstatus: open -> closed
resolution: rejected
stage: patch review -> resolved
2017-09-18 19:11:47pitrousetnosy: + pitrou
messages: + msg302481
2017-09-17 16:10:50xiang.zhangsetmessages: + msg302371
2017-09-17 10:22:05serhiy.storchakasetmessages: + msg302362
2017-09-17 04:26:45ezio.melottisetmessages: + msg302358
2017-09-16 14:46:15terry.reedysetmessages: + msg302343
2017-09-16 09:42:20methanesetmessages: + msg302330
2017-09-16 08:14:41serhiy.storchakasetnosy: + methane, xiang.zhang
messages: + msg302329
2017-09-15 23:05:27ezio.melottisetmessages: + msg302304
2017-09-15 22:32:09serhiy.storchakasetmessages: + msg302302
2017-09-15 20:53:37terry.reedysetmessages: + msg302300
2017-09-15 20:31:03serhiy.storchakasetmessages: + msg302298
2017-09-15 19:55:56terry.reedysetnosy: + terry.reedy
messages: + msg302295
2017-09-15 13:32:53serhiy.storchakasetkeywords: + patch
pull_requests: + pull_request3592
2017-09-15 13:29:39serhiy.storchakacreate