This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author terry.reedy
Recipients benjamin.peterson, ezio.melotti, lemburg, serhiy.storchaka, terry.reedy, vstinner
Date 2017-09-15.20:53:37
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1505508817.11.0.544062520532.issue31484@psf.upfronthosting.co.za>
In-reply-to
Content
I looked at the Gutenburg samples.  The first has a short intro with some English, then is pure Greek.  The patch is clearly good for anyone using mostly a single block alphabetic language.

The second is Chinese, not hieroglyphs (ancient Egyptian).  A slowdown for ancient Egyptian is irrelevant; a slowdown for Chinese is undesirable.  Japanese mostly uses about 2000 Chinese chars, the Chinses more.  Even if the common chars are grouped together (I don't know), there are at least 10 possible chars for each 2-char slot.  So I am not surprised at a net slowdown.  I would also not be surprised if Japanese fared worse, as it uses at least 2 blocks for its kana and uses many latin chars.

Unless we go beyond 2 x 256 slots, caching CJK is hopeless.  Have you considered limiting the caching to the blocks before the CJK blocks, up to, say, U+31BF?  https://en.wikipedia.org/wiki/Unicode_block.  Both Japanese and Korean might then see an actual speedup.
History
Date User Action Args
2017-09-15 20:53:37terry.reedysetrecipients: + terry.reedy, lemburg, vstinner, benjamin.peterson, ezio.melotti, serhiy.storchaka
2017-09-15 20:53:37terry.reedysetmessageid: <1505508817.11.0.544062520532.issue31484@psf.upfronthosting.co.za>
2017-09-15 20:53:37terry.reedylinkissue31484 messages
2017-09-15 20:53:37terry.reedycreate