Message 302300 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	terry.reedy
Recipients	benjamin.peterson, ezio.melotti, lemburg, serhiy.storchaka, terry.reedy, vstinner
Date	2017-09-15.20:53:37
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1505508817.11.0.544062520532.issue31484@psf.upfronthosting.co.za>
In-reply-to

Content
I looked at the Gutenburg samples. The first has a short intro with some English, then is pure Greek. The patch is clearly good for anyone using mostly a single block alphabetic language. The second is Chinese, not hieroglyphs (ancient Egyptian). A slowdown for ancient Egyptian is irrelevant; a slowdown for Chinese is undesirable. Japanese mostly uses about 2000 Chinese chars, the Chinses more. Even if the common chars are grouped together (I don't know), there are at least 10 possible chars for each 2-char slot. So I am not surprised at a net slowdown. I would also not be surprised if Japanese fared worse, as it uses at least 2 blocks for its kana and uses many latin chars. Unless we go beyond 2 x 256 slots, caching CJK is hopeless. Have you considered limiting the caching to the blocks before the CJK blocks, up to, say, U+31BF? https://en.wikipedia.org/wiki/Unicode_block. Both Japanese and Korean might then see an actual speedup.

I looked at the Gutenburg samples.  The first has a short intro with some English, then is pure Greek.  The patch is clearly good for anyone using mostly a single block alphabetic language.

The second is Chinese, not hieroglyphs (ancient Egyptian).  A slowdown for ancient Egyptian is irrelevant; a slowdown for Chinese is undesirable.  Japanese mostly uses about 2000 Chinese chars, the Chinses more.  Even if the common chars are grouped together (I don't know), there are at least 10 possible chars for each 2-char slot.  So I am not surprised at a net slowdown.  I would also not be surprised if Japanese fared worse, as it uses at least 2 blocks for its kana and uses many latin chars.

Unless we go beyond 2 x 256 slots, caching CJK is hopeless.  Have you considered limiting the caching to the blocks before the CJK blocks, up to, say, U+31BF?  https://en.wikipedia.org/wiki/Unicode_block.  Both Japanese and Korean might then see an actual speedup.

History
Date	User	Action	Args
2017-09-15 20:53:37	terry.reedy	set	recipients: + terry.reedy, lemburg, vstinner, benjamin.peterson, ezio.melotti, serhiy.storchaka
2017-09-15 20:53:37	terry.reedy	set	messageid: <1505508817.11.0.544062520532.issue31484@psf.upfronthosting.co.za>
2017-09-15 20:53:37	terry.reedy	link	issue31484 messages
2017-09-15 20:53:37	terry.reedy	create