This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author serhiy.storchaka
Recipients benjamin.peterson, ezio.melotti, lemburg, serhiy.storchaka, vstinner
Date 2017-09-15.13:29:39
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1505482179.61.0.572534667387.issue31484@psf.upfronthosting.co.za>
In-reply-to
Content
Single-character strings in the Latin1 range (U+0000 - U+00FF) are shared in CPython. This saves memory and CPU time of per-character processing of strings containing ASCII characters and characters from Latin based alphabets. But the users of languages that use non-Latin based alphabets are not so lucky. Proposed PR adds a cache for characters in BMP (U+0100 - U+FFFF) which covers most alphabetic scripts.

Most alphabets contain characters from a single 128- or 256-character block, therefore only lowest bits are used for addressing in the cache. But together with the characters from a particular alphabet it is common to use ASCII characters (spaces, newline, punctuations, digits, etc) and few Unicode punctuation (long dash, Unicode quotes, etc). Their low bits  can match low bits of letters. Therefore every index addresses not a single character, but a mini-LRU-cache of size 2. This keeps letters in a cache even if non-letters with conflicting low bits are occurred.

Microbenchmarking results.

Iterating sample non-Latin-based alphabetic text (Iliad by Homer [1]) is over 70% faster:

$ ./python -m timeit -s 's = open("36248-0.txt").read()' -- 'for c in s: pass'
Unpatched:  20 loops, best of 5: 14.5 msec per loop
Patched:    50 loops, best of 5: 8.32 msec per loop

Iterating sample hieroglyphic text (Shui Hu Zhuan by Nai an Shi [2]) is about 4% slower:

$ ./python -m timeit -s 's = open("23863-0.txt").read()' -- 'for c in s: pass'
Unpatched:  20 loops, best of 5: 11.7 msec per loop
Patched:    20 loops, best of 5: 12.1 msec per loop

Iterating a string containing non-repeated characters from the all BMP range is 20% slower:

$ ./python -m timeit -s 's = "".join(map(chr, range(0x10000)))' -- 'for c in s: pass'
Unpatched:  200 loops, best of 5: 1.39 msec per loop
Patched:    200 loops, best of 5: 1.7 msec per loop


[1] https://www.gutenberg.org/files/36248/36248-0.txt
[2] https://www.gutenberg.org/files/23863/23863-0.txt
History
Date User Action Args
2017-09-15 13:29:39serhiy.storchakasetrecipients: + serhiy.storchaka, lemburg, vstinner, benjamin.peterson, ezio.melotti
2017-09-15 13:29:39serhiy.storchakasetmessageid: <1505482179.61.0.572534667387.issue31484@psf.upfronthosting.co.za>
2017-09-15 13:29:39serhiy.storchakalinkissue31484 messages
2017-09-15 13:29:39serhiy.storchakacreate